|
|
-
Re: Flume compression peculiar behaviour while processing compressed files by a map reduce jobJagadish Bihani 2012-10-30, 03:06
Does anyone have any inputs about why below mentioned behaviour might
have happened? On 10/26/2012 06:32 PM, Jagadish Bihani wrote: > > Same thing happens even for gzip. > > Regards, > Jagadish > > On 10/26/2012 04:30 PM, Jagadish Bihani wrote: >> Hi >> >> I have a very peculiar scenario. >> >> 1. My HDFS sink creates a bz2 file. File is perfectly fine I can >> decompress it and >> read it. It has 0.2 million records. >> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and >> surprisingly it only >> reads first 100 records. >> 3. I then decompress the same file on local file system and use bzip2 >> command of >> linux to again compress it and copy to HDFS. >> 4. Now I run the map -reduce job and this time it correctly processes >> all the records. >> >> I think flume agent writes compressed data to HDFS file in batches. >> And somehow >> bzip2 codec used by hadoop uses only first part of it. >> >> This way bz2 files generated by Flume, if used directly, can't be >> processed by Map reduce job. >> Is there any solution to it? >> >> Any inputs about other compression formats? >> >> P.S. >> Versions: >> >> Flume 1.2.0 (Raw version; downloaded from >> http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz) >> Hadoop 1.0.3 >> >> Regards, >> Jagadish > |