Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> Question about gzip compression when using Flume Ng


+
Sagar Mehta 2013-01-14, 19:18
+
Connor Woodson 2013-01-14, 22:25
+
Sagar Mehta 2013-01-14, 22:34
+
Connor Woodson 2013-01-14, 22:52
+
Sagar Mehta 2013-01-14, 23:12
+
Brock Noland 2013-01-14, 23:16
+
Sagar Mehta 2013-01-14, 23:24
+
Sagar Mehta 2013-01-14, 23:27
Copy link to this message
-
Re: Question about gzip compression when using Flume Ng
Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old,
I would upgrade to CDH3u5 or CDH 4.1.2.

On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote:
> About the bz2 suggestion, we have a ton of downstream jobs that assume gzip
> compressed files - so it is better to stick to gzip.
>
> The plan B for us is to have a Oozie step to gzip compress the logs before
> proceeding with downstream Hadoop jobs - but that looks like a hack to me!!
>
> Sagar
>
>
> On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote:
>>
>> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat
>> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l
>>
>> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: decompression
>> OK, trailing garbage ignored
>> 100
>>
>> This should be about 50,000 events for the 5 min window!!
>>
>> Sagar
>>
>> On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi,
>>>
>>> Can you try:  zcat file > output
>>>
>>> I think what is occurring is because of the flush the output file is
>>> actually several concatenated gz files.
>>>
>>> Brock
>>>
>>> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[EMAIL PROTECTED]>
>>> wrote:
>>> > Yeah I have tried the text write format in vain before, but
>>> > nevertheless
>>> > gave it a try again!! Below is the latest file - still the same thing.
>>> >
>>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date
>>> > Mon Jan 14 23:02:07 UTC 2013
>>> >
>>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls
>>> >
>>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> > Found 1 items
>>> > -rw-r--r--   3 hadoop supergroup    4798117 2013-01-14 22:55
>>> >
>>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> >
>>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget
>>> >
>>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> > .
>>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip
>>> > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> >
>>> > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz:
>>> > decompression
>>> > OK, trailing garbage ignored
>>> >
>>> > Interestingly enough, the gzip page says it is a harmless warning -
>>> > http://www.gzip.org/#faq8
>>> >
>>> > However, I'm losing events on decompression so I cannot afford to
>>> > ignore
>>> > this warning. The gzip page gives an example about magnetic tape -
>>> > there is
>>> > an analogy of hdfs block here since the file is initially stored in
>>> > hdfs
>>> > before I pull it out on the local filesystem.
>>> >
>>> > Sagar
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson
>>> > <[EMAIL PROTECTED]>
>>> > wrote:
>>> >>
>>> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT
>>> >> collector102.sinks.sink2.hdfs.writeFormat = TEXT
>>> >
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Apache MRUnit - Unit testing MapReduce -
>>> http://incubator.apache.org/mrunit/
>>
>>
>

--
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
+
Sagar Mehta 2013-01-15, 00:43
+
Brock Noland 2013-01-15, 00:54
+
Sagar Mehta 2013-01-15, 01:03
+
Connor Woodson 2013-01-15, 01:17
+
Sagar Mehta 2013-01-15, 01:52
+
Bhaskar V. Karambelkar 2013-01-15, 01:25
+
Connor Woodson 2013-01-15, 01:26
+
Sagar Mehta 2013-01-15, 02:36
+
Connor Woodson 2013-01-14, 23:17