Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Flume bz2 issue while processing by a map reduce job


Copy link to this message
-
Re: Flume bz2 issue while processing by a map reduce job

Same thing happens even for gzip.

Regards,
Jagadish

On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
> Hi
>
> I have a very peculiar scenario.
>
>  1. My HDFS sink creates a bz2 file. File is perfectly fine I can
> decompress it and
> read it. It has 0.2 million records.
> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and
> surprisingly it only
> reads first 100 records.
> 3. I then decompress the same file on local file system and use bzip2
> command of
> linux to again compress it and copy to HDFS.
> 4. Now I run the map -reduce job and this time it correctly processes
> all the records.
>
> I think flume agent writes compressed data to HDFS file in batches.
> And somehow
> bzip2 codec used by hadoop uses only first part of it.
>
> This way bz2 files generated by Flume, if used directly, can't be
> processed by Map reduce job.
> Is there any solution to it?
>
> Any inputs about other compression formats?
>
> P.S.
> Versions:
>
> Flume 1.2.0 (Raw version; downloaded from
> http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
> Hadoop 1.0.3
>
> Regards,
> Jagadish