Thanks for the response Abdul.

   So, the bzip2 file in question is _kindof_ a concatenation of
multiple bzip2 files.  It's not concatenated using cat a.bz2 b.bz2 >
yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running on
CentOS 5.2 installed from the EPEL repository ).  My understanding is
that pbzip does roughly what you're saying and concatenates in some

   I created a simple test case that reproduces the behavior.  I created
a file using the following perl script:

for($i=0;$i<20000000;$i++) {
  print "Line $i\n";

    I then created two different bzip2 files.  One with bzip2 and one
with pbzip2.  The do have different sizes:

21994233 simple.bzip2.txt.bz2
21999416 simple.pbzip2.txt.bz2

    They do decompress to give the same output file
bunzip2 -c simple.bzip2.txt.bz2 | md5sum
581ad242e6cf22650072edd44d6a2d38  -

bunzip2 -c simple.pbzip2.txt.bz2 | md5sum
581ad242e6cf22650072edd44d6a2d38  -

   Running both through the simple line count MapReduce job I get the
same behavior where bzip2 correctly calculates 20,000,000 records, but
the pbzip2 generated file only processes the first block ( 82,829
records ).  

   So, it sounds like what you're saying of having multiple end of
stream markers makes sense.  I will say it would be very beneficial to
be able to use pbzip2 generated files to compress hadoop input files.
Using pbzip2 can greatly reduce the amount of time required to bzip2
compress files and seems to generate a valid bzip2 file ( at least it
bunzip2 decompresses correctly ).

   Thank you


-----Original Message-----
From: Abdul Qadeer [mailto:[EMAIL PROTECTED]]
Sent: Thursday, December 04, 2008 12:07 PM
Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0


As you said, you suspect that only one bzip2 block is being decompressed
and used; is you bzip2 file the concatenation of multiple bzip2 files
you doing something like cat a.bz2 b.bz2 c.bz2 > yourFile.bz2 ?)  In
a case, there will be many bzip2 end of stream markers in a single file
bzip2 decomprssor will stop on encountering the first end of block
when in fact, the stream has more data in it.

If this is not the case, then bzip2 should work as gzip or plaintext are
Currently only one mapper gets the whole file (just like gzip and
for bzip is being added in HADOOP-4012, as Alex mentioned).  The
get the uncompressed data and does rest of the things same as in the
of gzip or plaintext.  So can you provide your bzip2 compressed file?
uploading it somewhere and sending in the link)  I will look into this
Abdul Qadeer

On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins
> tests  between a .gz file which seems to do what I expect, and the
> file .bz2 which doesn't behave as I expect.
>    I have the same file stored in hadoop compressed as both bzip2 and
> gz formats.  The uncompressed file size is 660,841,894 bytes.
> the files they both seem to be valid archives of the exact same file.
> /usr/local/hadoop/bin/hadoop dfs -cat
> bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum
> 2c82901170f44245fb04d24ad4746e38  -
> /usr/local/hadoop/bin/hadoop dfs -cat
> | gunzip -c | md5sum
> 2c82901170f44245fb04d24ad4746e38  -
>    Given the md5 sums match it seems like the files are the same and
> uncompress correctly.
>    Now when I run a simple Map/Reduce application that just counts
> lines in the file I get different results.
>  Expected Results:
>  /usr/local/hadoop/bin/hadoop dfs -cat
