Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - ExecSource copy does not match original. Thoughts please?


Copy link to this message
-
Re: ExecSource copy does not match original. Thoughts please?
Juhani Connolly 2013-08-09, 03:30
Hi Chris,

Flume as a whole doesn't guarantee lack of duplication. If a batch of
events in hdfs sink fails, flume will redo the full batch even if some
of it got written to hdfs. Various strategies exist for mitigating this
including reducing hdfs batch size or simply post-processing hdfs data
to remove duplicate logs.

Also, you should keep in mind that exec source does not guarrantee
delivery as it cannot resend data(so if something fails to enter the
channel it does not get generated again by the source).

Various solutions exist for this, one popular one being to use the
spooling directory source. We put together a simple program that works
sort of like tail that sends data over the scribed protocol to flume and
resends failed events.

On 08/01/2013 03:48 AM, Chris Neal wrote:
> Hi all.
>
> I have an ExecSource doing a tail -F on a log4J log file for an app,
> copying it into HDFS.  I get no errors/warnings/exceptions from the
> Flume nodes, but when I went to make sure that indeed the contents of
> the files matched, I found that they did not. :(  I tested several
> days worth of files, and none matched.  I'm not sure where to even
> start looking at this discrepancy. Does anyone have any thoughts?
>
> If I would have come across some errors somewhere, I would understand
> some differences, but for everything to appear to work fine, and then
> not match up, that concerns me.
>
> Thank you very much for any input.
> Chris
>
> In HDFS from Flume, file size in lines:
> [root@hadoopnn01 ~]# time sudo -u hdfs hadoop fs -text
> /pegs/logs/udprodae01/d1c1_udprodae01/UD/UDTrans/2013-07-27/UDXMLTrans.log.*
> | wc -l
>
> 2812850
>
> Actual source file size in lines:
> cneal@pegslog14[504]:/pegs/logcabin01/udprodae01/pegs/logs/udprodae01/d1c1_udprodae01/UD>
> time wc -l UDXMLTrans.log.2013-07-27
>
>  2812843 UDXMLTrans.log.2013-07-27
>
> The source file:
> cneal@pegslog14[505]:/pegs/logcabin01/udprodae01/pegs/logs/udprodae01/d1c1_udprodae01/UD>
> ls -l UDXMLTrans.log.2013-07-27
> -rw-r--r--   1 logger   other    19228787343 Jul 28 00:00
> UDXMLTrans.log.2013-07-27
>
> The files in HDFS:
> [root@hadoopnn01 ~]# time sudo -u hdfs hadoop fs -ls
> /pegs/logs/udprodae01/d1c1_udprodae01/UD/UDTrans/2013-07-27/UDXMLTrans.log.*
> Found 1 items
> -rw-r--r--   3 flume supergroup  200021549 2013-07-28 00:00
> /pegs/logs/udprodae01/d1c1_udprodae01/UD/UDTrans/2013-07-27/UDXMLTrans.log.2013-07-27_1.1374883211499.gz
> Found 1 items
> -rw-r--r--   3 flume supergroup  195398211 2013-07-28 00:00
> /pegs/logs/udprodae01/d1c1_udprodae01/UD/UDTrans/2013-07-27/UDXMLTrans.log.2013-07-27_10.1374883210982.gz
> Found 1 items
> -rw-r--r--   3 root  supergroup  193557330 2013-07-28 00:00
> /pegs/logs/udprodae01/d1c1_udprodae01/UD/UDTrans/2013-07-27/UDXMLTrans.log.2013-07-27_13.1374883212709.gz
> Found 1 items
> -rw-r--r--   3 root  supergroup  194163091 2013-07-28 00:00
> /pegs/logs/udprodae01/d1c1_udprodae01/UD/UDTrans/2013-07-27/UDXMLTrans.log.2013-07-27_14.1374883212712.gz
> Found 1 items
> -rw-r--r--   3 flume supergroup  192546288 2013-07-28 00:00
> /pegs/logs/udprodae01/d1c1_udprodae01/UD/UDTrans/2013-07-27/UDXMLTrans.log.2013-07-27_2.1374883211446.gz
> Found 1 items
> -rw-r--r--   3 root  supergroup  191863735 2013-07-28 00:00
> /pegs/logs/udprodae01/d1c1_udprodae01/UD/UDTrans/2013-07-27/UDXMLTrans.log.2013-07-27_5.1374883208056.gz
> Found 1 items
> -rw-r--r--   3 root  supergroup  196733297 2013-07-28 00:00
> /pegs/logs/udprodae01/d1c1_udprodae01/UD/UDTrans/2013-07-27/UDXMLTrans.log.2013-07-27_6.1374883208056.gz
> Found 1 items
> -rw-r--r--   3 flume supergroup  193451845 2013-07-28 00:00
> /pegs/logs/udprodae01/d1c1_udprodae01/UD/UDTrans/2013-07-27/UDXMLTrans.log.2013-07-27_9.1374883210989.gz
>
>
>
>
>