Hi Mike
Thanks for the valuable inputs. That was driving us crazy.
But I had tested that this issue doesn't happen with compression format
lzo/lzop (tested on hadoop 1.0.3).
Regards,
Jagadish
On 11/02/2012 03:16 PM, Mike Percy wrote:
> Hi Jagadish,
> My understanding based on investigating this issue over the last
> couple of days is that MapReduce jobs will only read the first section
> of a concatenaed bzip2 file. I believe you are correct that
>
https://issues.apache.org/jira/browse/HADOOP-6852 is the only way to
> solve this issue, and that would only be for the Hadoop 2.0 line, I
> believe. I think that the Hadoop 1.x line would need to backport other
> patches from the 0.22 line, including
>
https://issues.apache.org/jira/browse/HADOOP-6835, which may also be
> needed (my understanding is that that patch is already included in the
> 2.x line).
>
> I am aware of folks interested in trying to fix HADOOP-6852, however I
> have no ETA to give.
>
> From Flume's perspective, I know of no other way of ensuring
> durability using the hadoop-common APIs except for calling finalize in
> order to flume the compression buffer at each transaction/batch
> boundary, in order to call hflush()/hsync() with the fully written
> data. This results in concatenated compressed plain text files in the
> case of CompressedStream.
>
> Current workarounds include not using compression, reprocessing the
> compressed file as you mention, using a SequenceFile as a container,
> or using an Avro file as a container. The latter two are splittable
> and properly handle several compression codecs, including Snappy,
> which is a great way to go if you can do it.
>
> Regards,
> Mike
>
> On Fri, Nov 2, 2012 at 12:50 AM, Jagadish Bihani
> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
> wrote:
>
> Hi
>
> Any inputs on this?
> It looks like a basic thing which, I guess, must have been handled
> in flume
>
>
>
> On 10/30/2012 10:31 PM, Jagadish Bihani wrote:
>> Text.
>>
>> Few updates on that:
>> -- It looks like some header issue.
>> -- When I copyToLocal the file and then again copy it back to HDFS,
>> map reduce job processes the the file correctly then.
>> Is it something related to
>>
https://issues.apache.org/jira/browse/HADOOP-6852?>>
>> Regards,
>> Jagadish
>>
>>
>> On 10/30/2012 09:15 PM, Brock Noland wrote:
>>> What kind of files is your sink writing out? Text, Sequence, etc?
>>>
>>> On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani
>>> <[EMAIL PROTECTED]> <mailto:[EMAIL PROTECTED]> wrote:
>>>> Same thing happens even for gzip.
>>>>
>>>> Regards,
>>>> Jagadish
>>>>
>>>>
>>>> On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>>>>> Hi
>>>>>
>>>>> I have a very peculiar scenario.
>>>>>
>>>>> 1. My HDFS sink creates a bz2 file. File is perfectly fine I can
>>>>> decompress it and
>>>>> read it. It has 0.2 million records.
>>>>> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
>>>>> it only
>>>>> reads first 100 records.
>>>>> 3. I then decompress the same file on local file system and use bzip2
>>>>> command of
>>>>> linux to again compress it and copy to HDFS.
>>>>> 4. Now I run the map -reduce job and this time it correctly processes all
>>>>> the records.
>>>>>
>>>>> I think flume agent writes compressed data to HDFS file in batches. And
>>>>> somehow
>>>>> bzip2 codec used by hadoop uses only first part of it.
>>>>>
>>>>> This way bz2 files generated by Flume, if used directly, can't be
>>>>> processed by Map reduce job.
>>>>> Is there any solution to it?
>>>>>
>>>>> Any inputs about other compression formats?
>>>>>
>>>>> P.S.
>>>>> Versions:
>>>>>
>>>>> Flume 1.2.0 (Raw version; downloaded from
>>>>>
http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)