Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Flume bz2 issue while processing by a map reduce job


Copy link to this message
-
Re: Flume bz2 issue while processing by a map reduce job
Brock Noland 2012-10-30, 15:45
What kind of files is your sink writing out? Text, Sequence, etc?

On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani
<[EMAIL PROTECTED]> wrote:
>
> Same thing happens even for gzip.
>
> Regards,
> Jagadish
>
>
> On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>>
>> Hi
>>
>> I have a very peculiar scenario.
>>
>>  1. My HDFS sink creates a bz2 file. File is perfectly fine I can
>> decompress it and
>> read it. It has 0.2 million records.
>> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
>> it only
>> reads first 100 records.
>> 3. I then decompress the same file on local file system and use bzip2
>> command of
>> linux to again compress it and copy to HDFS.
>> 4. Now I run the map -reduce job and this time it correctly processes all
>> the records.
>>
>> I think flume agent writes compressed data to HDFS file in batches. And
>> somehow
>> bzip2 codec used by hadoop uses only first part of it.
>>
>> This way bz2 files generated by Flume, if used directly, can't be
>> processed by Map reduce job.
>> Is there any solution to it?
>>
>> Any inputs about other compression formats?
>>
>> P.S.
>> Versions:
>>
>> Flume 1.2.0 (Raw version; downloaded from
>> http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
>> Hadoop 1.0.3
>>
>> Regards,
>> Jagadish
>
>

--
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/