Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Losing Records with Block Compressed Sequence File


Copy link to this message
-
Re: Losing Records with Block Compressed Sequence File
Niels Basjes 2011-01-22, 12:45
Hi,

2011/1/21 David Sinclair <[EMAIL PROTECTED]>:
> Hi, I am seeing an odd problem when writing block compressed sequence files.
> If I write 400,000 records into a sequence file w/o compression, all 400K
> end up in the file. If I write with block, regardless if it is bz2 or
> deflate, I start losing records. Not a ton, but a couple hundred.

How big is the output file?
How many splits are created?

> Here are the exact numbers
>
> bz2      399,734
> deflate  399,770
> none     400,000
>
> Conf settings
> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
>
> anyone ever see this behavior?

I've been working on HADOOP-7076 which makes Gzip Splittable (feature
is almost done).
For this I created a junit test that really hammers the splitting and
checks if all "seems" are accurate (no missing records and no double
records).
A few days ago I tried my Unit test against bzip2 and found a similar
effect: records go missing at the seems between the splits.

Perhaps my unit test is buggy, perhaps you and I have independently
found something that should be reported as a bug.
--
Best regards,

Niels Basjes