Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Flume bz2 issue while processing by a map reduce job


Copy link to this message
-
Re: Flume bz2 issue while processing by a map reduce job

Same thing happens even for gzip.

Regards,
Jagadish

On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
> Hi
>
> I have a very peculiar scenario.
>
>  1. My HDFS sink creates a bz2 file. File is perfectly fine I can
> decompress it and
> read it. It has 0.2 million records.
> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and
> surprisingly it only
> reads first 100 records.
> 3. I then decompress the same file on local file system and use bzip2
> command of
> linux to again compress it and copy to HDFS.
> 4. Now I run the map -reduce job and this time it correctly processes
> all the records.
>
> I think flume agent writes compressed data to HDFS file in batches.
> And somehow
> bzip2 codec used by hadoop uses only first part of it.
>
> This way bz2 files generated by Flume, if used directly, can't be
> processed by Map reduce job.
> Is there any solution to it?
>
> Any inputs about other compression formats?
>
> P.S.
> Versions:
>
> Flume 1.2.0 (Raw version; downloaded from
> http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
> Hadoop 1.0.3
>
> Regards,
> Jagadish
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB