Please note that I applied patch 4012 version 4 to release 0.19.0 and
re-ran tests with mixed results.  My simple test ( 20 million simple
records ) for both pbzip2/bzip2 generated the same correct results which
is great.  However, a larger test case ( described in more detail below
) had a discrepancy in the results when compared to gzip and plain text
files. bzip2/gzip/text all had produced the same results pre-patch.  The
bzip2 run had 3 additional records compared to the text/gzip runs post

   The following are timings and results for a sample dataset running a
simple MapReduce job ( MapReduce version of unix 'wc' ).  Note the
dataset consists of 11 files that are a total of 27G uncompressed, 4.5G
gzip compressed and 3.1G bzip2 compressed.  All 3 datasets are identical
and produce the same md5sum.  Also the bzip2 files in the test were
compressed using bzip2, not pbzip2.

Release .19.0 Pre patch:
   Type    Timing   MapReduce Result  
   Gzip  - 4m55s    323,234,098
   Bzip2 - 16m14s   323,234,098
   Txt   - 6m23s    323,234,098

Release .19.0 Post patch 4012 Version 4 ( w /results )
   Type    Timing   MapReduce Result  
   Gzip  - 5m14s    332,234,098
   Bzip2 - 9m36s    332,234,101
   Txt   - 6m28s    332.234.098

   Both Gzip/Txt timings were about the same between runs.  Bzip2
elapsed time was reduced significantly.

   So, generally positive although looks like there might be an
edge-case causing slightly different results.  I'll work on putting
together a test case of manageable size that re-produces the result

   Thanks again for the help.


-----Original Message-----
From: Andy Sautins [mailto:[EMAIL PROTECTED]]
Sent: Thursday, December 04, 2008 2:29 PM
Subject: RE: Strange behavior with bzip2 input files w/release 0.19.0
   Thanks Abdul.  Very exciting that hadoop will soon be able to handle
not only pbzip2 files but also be able to split bzip2 files.  

   I will apply the patch and report back.

   Thank you


-----Original Message-----
From: Abdul Qadeer [mailto:[EMAIL PROTECTED]]
Sent: Thursday, December 04, 2008 1:49 PM
Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0


As was mentioned earlier that splitting support is being added for bzip2
and actually patch is under review now.  I think, pbzip2 generated files
work fine with that because the split algorithm finds the next start of
marker and does not use end of stream marker.  We rather use physical
end of file to know when stream ends.
So if you see at https://issues.apache.org/jira/browse/HADOOP-4012
you can download version 4 patch and apply it on Hadoop code and see
if its working for you or you can wait for the review process to
so that code becomes a part of standard Hadoop.  You can add yourself
as a watcher there at JIRA 4012, so that you know when its done.  Please
let me know, if pbzip2 generated files does not work even on that code.

Thank you,
Abdul Qadeer
On Thu, Dec 4, 2008 at 11:46 AM, Andy Sautins
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB