Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Losing Records with Block Compressed Sequence File


Copy link to this message
-
Re: Losing Records with Block Compressed Sequence File
Hi,

2011/1/21 David Sinclair <[EMAIL PROTECTED]>:
> Hi, I am seeing an odd problem when writing block compressed sequence files.
> If I write 400,000 records into a sequence file w/o compression, all 400K
> end up in the file. If I write with block, regardless if it is bz2 or
> deflate, I start losing records. Not a ton, but a couple hundred.

How big is the output file?
How many splits are created?

> Here are the exact numbers
>
> bz2      399,734
> deflate  399,770
> none     400,000
>
> Conf settings
> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
>
> anyone ever see this behavior?

I've been working on HADOOP-7076 which makes Gzip Splittable (feature
is almost done).
For this I created a junit test that really hammers the splitting and
checks if all "seems" are accurate (no missing records and no double
records).
A few days ago I tried my Unit test against bzip2 and found a similar
effect: records go missing at the seems between the splits.

Perhaps my unit test is buggy, perhaps you and I have independently
found something that should be reported as a bug.
--
Best regards,

Niels Basjes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB