-Re: Losing Records with Block Compressed Sequence File
Niels Basjes 2011-01-22, 12:45
2011/1/21 David Sinclair <[EMAIL PROTECTED]>:
> Hi, I am seeing an odd problem when writing block compressed sequence files.
> If I write 400,000 records into a sequence file w/o compression, all 400K
> end up in the file. If I write with block, regardless if it is bz2 or
> deflate, I start losing records. Not a ton, but a couple hundred.
How big is the output file?
How many splits are created?
> Here are the exact numbers
> bz2 399,734
> deflate 399,770
> none 400,000
> Conf settings
> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
> anyone ever see this behavior?
I've been working on HADOOP-7076 which makes Gzip Splittable (feature
is almost done).
For this I created a junit test that really hammers the splitting and
checks if all "seems" are accurate (no missing records and no double
A few days ago I tried my Unit test against bzip2 and found a similar
effect: records go missing at the seems between the splits.
Perhaps my unit test is buggy, perhaps you and I have independently
found something that should be reported as a bug.