Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # dev >> Is Hadoop SequenceFile binary safe?


Copy link to this message
-
Re: Is Hadoop SequenceFile binary safe?
You're not missing anything, but the probability of a 16 (thought it
was 20?) byte collision with random bytes is vanishingly small. -C

On Sat, Apr 27, 2013 at 4:30 AM, Hs <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I am learning hadoop.  I read the SequenceFile.java in hadoop-1.0.4 source
> codes. And I find the sync(long position) method which is used to find a
> "sync marker" (a 16 bytes MD5 when generated at file creation time) in
> SequenceFile when splitting SequenceFile into splits in MapReduce.
>
> /** Seek to the next sync mark past a given position.*/public
> synchronized void sync(long position) throws IOException {
>   if (position+SYNC_SIZE >= end) {
>     seek(end);
>     return;
>   }
>
>   try {
>     seek(position+4);                         // skip escape
>     in.readFully(syncCheck);
>     int syncLen = sync.length;
>     for (int i = 0; in.getPos() < end; i++) {
>       int j = 0;
>       for (; j < syncLen; j++) {
>         if (sync[j] != syncCheck[(i+j)%syncLen])
>           break;
>       }
>       if (j == syncLen) {
>         in.seek(in.getPos() - SYNC_SIZE);     // position before sync
>         return;
>       }
>       syncCheck[i%syncLen] = in.readByte();
>     }
>   } catch (ChecksumException e) {             // checksum failure
>     handleChecksumException(e);
>   }}
>
> According to my understanding, these codes simply look for a data sequence
> which contain the same data as "sync marker".
>
> My doubt:
> Consider a situation where the data in a SequenceFile happen to contain a
> 16 bytes data sequence the same as "sync marker", the codes above will
> mistakenly treat that 16-bytes data as a "sync marker" and then the
> SequenceFile won't be correctly parsed?
>
> I don't find any "escape" operation about the data or the sync marker. So,
> how can SequenceFile be binary safe? Am I missing something? Please correct
> me if I am wrong.
>
> Thanks!
>
> Shawn
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB