Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Flume bz2 issue while processing by a map reduce job

Copy link to this message
Re: Flume bz2 issue while processing by a map reduce job
Hi Mike

Thanks for the valuable inputs. That was driving us crazy.
But I had tested that this issue doesn't happen with compression format
  lzo/lzop (tested on hadoop 1.0.3).


On 11/02/2012 03:16 PM, Mike Percy wrote:
> Hi Jagadish,
> My understanding based on investigating this issue over the last
> couple of days is that MapReduce jobs will only read the first section
> of a concatenaed bzip2 file. I believe you are correct that
> https://issues.apache.org/jira/browse/HADOOP-6852 is the only way to
> solve this issue, and that would only be for the Hadoop 2.0 line, I
> believe. I think that the Hadoop 1.x line would need to backport other
> patches from the 0.22 line, including
> https://issues.apache.org/jira/browse/HADOOP-6835, which may also be
> needed (my understanding is that that patch is already included in the
> 2.x line).
> I am aware of folks interested in trying to fix HADOOP-6852, however I
> have no ETA to give.
> From Flume's perspective, I know of no other way of ensuring
> durability using the hadoop-common APIs except for calling finalize in
> order to flume the compression buffer at each transaction/batch
> boundary, in order to call hflush()/hsync() with the fully written
> data. This results in concatenated compressed plain text files in the
> case of CompressedStream.
> Current workarounds include not using compression, reprocessing the
> compressed file as you mention, using a SequenceFile as a container,
> or using an Avro file as a container. The latter two are splittable
> and properly handle several compression codecs, including Snappy,
> which is a great way to go if you can do it.
> Regards,
> Mike
> On Fri, Nov 2, 2012 at 12:50 AM, Jagadish Bihani
> wrote:
>     Hi
>     Any inputs on this?
>     It looks like a basic thing which, I guess, must have been handled
>     in flume
>     On 10/30/2012 10:31 PM, Jagadish Bihani wrote:
>>     Text.
>>     Few updates on that:
>>     -- It looks like some header issue.
>>     -- When I copyToLocal the file and then again copy it back to HDFS,
>>     map reduce job processes the the file correctly then.
>>     Is it something related to
>>     https://issues.apache.org/jira/browse/HADOOP-6852?
>>     Regards,
>>     Jagadish
>>     On 10/30/2012 09:15 PM, Brock Noland wrote:
>>>     What kind of files is your sink writing out? Text, Sequence, etc?
>>>     On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani
>>>     <[EMAIL PROTECTED]>  <mailto:[EMAIL PROTECTED]>  wrote:
>>>>     Same thing happens even for gzip.
>>>>     Regards,
>>>>     Jagadish
>>>>     On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>>>>>     Hi
>>>>>     I have a very peculiar scenario.
>>>>>       1. My HDFS sink creates a bz2 file. File is perfectly fine I can
>>>>>     decompress it and
>>>>>     read it. It has 0.2 million records.
>>>>>     2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
>>>>>     it only
>>>>>     reads first 100 records.
>>>>>     3. I then decompress the same file on local file system and use bzip2
>>>>>     command of
>>>>>     linux to again compress it and copy to HDFS.
>>>>>     4. Now I run the map -reduce job and this time it correctly processes all
>>>>>     the records.
>>>>>     I think flume agent writes compressed data to HDFS file in batches. And
>>>>>     somehow
>>>>>     bzip2 codec used by hadoop uses only first part of it.
>>>>>     This way bz2 files generated by Flume, if used directly, can't be
>>>>>     processed by Map reduce job.
>>>>>     Is there any solution to it?
>>>>>     Any inputs about other compression formats?
>>>>>     P.S.
>>>>>     Versions:
>>>>>     Flume 1.2.0 (Raw version; downloaded from
>>>>>     http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)