Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # dev >> Is Hadoop SequenceFile binary safe?


Copy link to this message
-
Re: Is Hadoop SequenceFile binary safe?
You're not missing anything, but the probability of a 16 (thought it
was 20?) byte collision with random bytes is vanishingly small. -C

On Sat, Apr 27, 2013 at 4:30 AM, Hs <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I am learning hadoop.  I read the SequenceFile.java in hadoop-1.0.4 source
> codes. And I find the sync(long position) method which is used to find a
> "sync marker" (a 16 bytes MD5 when generated at file creation time) in
> SequenceFile when splitting SequenceFile into splits in MapReduce.
>
> /** Seek to the next sync mark past a given position.*/public
> synchronized void sync(long position) throws IOException {
>   if (position+SYNC_SIZE >= end) {
>     seek(end);
>     return;
>   }
>
>   try {
>     seek(position+4);                         // skip escape
>     in.readFully(syncCheck);
>     int syncLen = sync.length;
>     for (int i = 0; in.getPos() < end; i++) {
>       int j = 0;
>       for (; j < syncLen; j++) {
>         if (sync[j] != syncCheck[(i+j)%syncLen])
>           break;
>       }
>       if (j == syncLen) {
>         in.seek(in.getPos() - SYNC_SIZE);     // position before sync
>         return;
>       }
>       syncCheck[i%syncLen] = in.readByte();
>     }
>   } catch (ChecksumException e) {             // checksum failure
>     handleChecksumException(e);
>   }}
>
> According to my understanding, these codes simply look for a data sequence
> which contain the same data as "sync marker".
>
> My doubt:
> Consider a situation where the data in a SequenceFile happen to contain a
> 16 bytes data sequence the same as "sync marker", the codes above will
> mistakenly treat that 16-bytes data as a "sync marker" and then the
> SequenceFile won't be correctly parsed?
>
> I don't find any "escape" operation about the data or the sync marker. So,
> how can SequenceFile be binary safe? Am I missing something? Please correct
> me if I am wrong.
>
> Thanks!
>
> Shawn