-Re: Flume compression peculiar behaviour while processing compressed files by a map reduce job
Does anyone have any inputs about why below mentioned behaviour might
On 10/26/2012 06:32 PM, Jagadish Bihani wrote:
> Same thing happens even for gzip.
> On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>> I have a very peculiar scenario.
>> 1. My HDFS sink creates a bz2 file. File is perfectly fine I can
>> decompress it and
>> read it. It has 0.2 million records.
>> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and
>> surprisingly it only
>> reads first 100 records.
>> 3. I then decompress the same file on local file system and use bzip2
>> command of
>> linux to again compress it and copy to HDFS.
>> 4. Now I run the map -reduce job and this time it correctly processes
>> all the records.
>> I think flume agent writes compressed data to HDFS file in batches.
>> And somehow
>> bzip2 codec used by hadoop uses only first part of it.
>> This way bz2 files generated by Flume, if used directly, can't be
>> processed by Map reduce job.
>> Is there any solution to it?
>> Any inputs about other compression formats?
>> Flume 1.2.0 (Raw version; downloaded from
>> Hadoop 1.0.3