-Re: Flume bz2 issue while processing by a map reduce job
Jagadish Bihani 2012-10-30, 17:01
Few updates on that:
-- It looks like some header issue.
-- When I copyToLocal the file and then again copy it back to HDFS,
map reduce job processes the the file correctly then.
Is it something related to
On 10/30/2012 09:15 PM, Brock Noland wrote:
> What kind of files is your sink writing out? Text, Sequence, etc?
> On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani
> <[EMAIL PROTECTED]> wrote:
>> Same thing happens even for gzip.
>> On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>>> I have a very peculiar scenario.
>>> 1. My HDFS sink creates a bz2 file. File is perfectly fine I can
>>> decompress it and
>>> read it. It has 0.2 million records.
>>> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
>>> it only
>>> reads first 100 records.
>>> 3. I then decompress the same file on local file system and use bzip2
>>> command of
>>> linux to again compress it and copy to HDFS.
>>> 4. Now I run the map -reduce job and this time it correctly processes all
>>> the records.
>>> I think flume agent writes compressed data to HDFS file in batches. And
>>> bzip2 codec used by hadoop uses only first part of it.
>>> This way bz2 files generated by Flume, if used directly, can't be
>>> processed by Map reduce job.
>>> Is there any solution to it?
>>> Any inputs about other compression formats?
>>> Flume 1.2.0 (Raw version; downloaded from
>>> Hadoop 1.0.3