|
|
+
Jagadish Bihani 2012-10-26, 11:00
+
Jagadish Bihani 2012-10-26, 13:02
+
Brock Noland 2012-10-30, 15:45
+
Jagadish Bihani 2012-10-30, 17:01
+
Jagadish Bihani 2012-11-02, 07:50
+
Mike Percy 2012-11-02, 09:46
-
Re: Flume bz2 issue while processing by a map reduce jobJagadish Bihani 2012-11-03, 11:32
Hi Mike
Thanks for the valuable inputs. That was driving us crazy. But I had tested that this issue doesn't happen with compression format lzo/lzop (tested on hadoop 1.0.3). Regards, Jagadish On 11/02/2012 03:16 PM, Mike Percy wrote: > Hi Jagadish, > My understanding based on investigating this issue over the last > couple of days is that MapReduce jobs will only read the first section > of a concatenaed bzip2 file. I believe you are correct that > https://issues.apache.org/jira/browse/HADOOP-6852 is the only way to > solve this issue, and that would only be for the Hadoop 2.0 line, I > believe. I think that the Hadoop 1.x line would need to backport other > patches from the 0.22 line, including > https://issues.apache.org/jira/browse/HADOOP-6835, which may also be > needed (my understanding is that that patch is already included in the > 2.x line). > > I am aware of folks interested in trying to fix HADOOP-6852, however I > have no ETA to give. > > From Flume's perspective, I know of no other way of ensuring > durability using the hadoop-common APIs except for calling finalize in > order to flume the compression buffer at each transaction/batch > boundary, in order to call hflush()/hsync() with the fully written > data. This results in concatenated compressed plain text files in the > case of CompressedStream. > > Current workarounds include not using compression, reprocessing the > compressed file as you mention, using a SequenceFile as a container, > or using an Avro file as a container. The latter two are splittable > and properly handle several compression codecs, including Snappy, > which is a great way to go if you can do it. > > Regards, > Mike > > On Fri, Nov 2, 2012 at 12:50 AM, Jagadish Bihani > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > wrote: > > Hi > > Any inputs on this? > It looks like a basic thing which, I guess, must have been handled > in flume > > > > On 10/30/2012 10:31 PM, Jagadish Bihani wrote: >> Text. >> >> Few updates on that: >> -- It looks like some header issue. >> -- When I copyToLocal the file and then again copy it back to HDFS, >> map reduce job processes the the file correctly then. >> Is it something related to >> https://issues.apache.org/jira/browse/HADOOP-6852? >> >> Regards, >> Jagadish >> >> >> On 10/30/2012 09:15 PM, Brock Noland wrote: >>> What kind of files is your sink writing out? Text, Sequence, etc? >>> >>> On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani >>> <[EMAIL PROTECTED]> <mailto:[EMAIL PROTECTED]> wrote: >>>> Same thing happens even for gzip. >>>> >>>> Regards, >>>> Jagadish >>>> >>>> >>>> On 10/26/2012 04:30 PM, Jagadish Bihani wrote: >>>>> Hi >>>>> >>>>> I have a very peculiar scenario. >>>>> >>>>> 1. My HDFS sink creates a bz2 file. File is perfectly fine I can >>>>> decompress it and >>>>> read it. It has 0.2 million records. >>>>> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly >>>>> it only >>>>> reads first 100 records. >>>>> 3. I then decompress the same file on local file system and use bzip2 >>>>> command of >>>>> linux to again compress it and copy to HDFS. >>>>> 4. Now I run the map -reduce job and this time it correctly processes all >>>>> the records. >>>>> >>>>> I think flume agent writes compressed data to HDFS file in batches. And >>>>> somehow >>>>> bzip2 codec used by hadoop uses only first part of it. >>>>> >>>>> This way bz2 files generated by Flume, if used directly, can't be >>>>> processed by Map reduce job. >>>>> Is there any solution to it? >>>>> >>>>> Any inputs about other compression formats? >>>>> >>>>> P.S. >>>>> Versions: >>>>> >>>>> Flume 1.2.0 (Raw version; downloaded from >>>>> http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz) |