Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Losing Records with Block Compressed Sequence File

Copy link to this message
Re: Losing Records with Block Compressed Sequence File

2011/1/21 David Sinclair <[EMAIL PROTECTED]>:
> Hi, I am seeing an odd problem when writing block compressed sequence files.
> If I write 400,000 records into a sequence file w/o compression, all 400K
> end up in the file. If I write with block, regardless if it is bz2 or
> deflate, I start losing records. Not a ton, but a couple hundred.

How big is the output file?
How many splits are created?

> Here are the exact numbers
> bz2      399,734
> deflate  399,770
> none     400,000
> Conf settings
> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
> anyone ever see this behavior?

I've been working on HADOOP-7076 which makes Gzip Splittable (feature
is almost done).
For this I created a junit test that really hammers the splitting and
checks if all "seems" are accurate (no missing records and no double
A few days ago I tried my Unit test against bzip2 and found a similar
effect: records go missing at the seems between the splits.

Perhaps my unit test is buggy, perhaps you and I have independently
found something that should be reported as a bug.
Best regards,

Niels Basjes