Jagadish Bihani 2012-10-26, 11:00
Same thing happens even for gzip.
On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
> I have a very peculiar scenario.
> 1. My HDFS sink creates a bz2 file. File is perfectly fine I can
> decompress it and
> read it. It has 0.2 million records.
> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and
> surprisingly it only
> reads first 100 records.
> 3. I then decompress the same file on local file system and use bzip2
> command of
> linux to again compress it and copy to HDFS.
> 4. Now I run the map -reduce job and this time it correctly processes
> all the records.
> I think flume agent writes compressed data to HDFS file in batches.
> And somehow
> bzip2 codec used by hadoop uses only first part of it.
> This way bz2 files generated by Flume, if used directly, can't be
> processed by Map reduce job.
> Is there any solution to it?
> Any inputs about other compression formats?
> Flume 1.2.0 (Raw version; downloaded from
> Hadoop 1.0.3
Brock Noland 2012-10-30, 15:45
Jagadish Bihani 2012-10-30, 17:01
Jagadish Bihani 2012-11-02, 07:50
Mike Percy 2012-11-02, 09:46
Jagadish Bihani 2012-11-03, 11:32