Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # dev >> Is Hadoop SequenceFile binary safe?


Copy link to this message
-
Is Hadoop SequenceFile binary safe?
Hi,

I am learning hadoop.  I read the SequenceFile.java in hadoop-1.0.4 source
codes. And I find the sync(long position) method which is used to find a
"sync marker" (a 16 bytes MD5 when generated at file creation time) in
SequenceFile when splitting SequenceFile into splits in MapReduce.

/** Seek to the next sync mark past a given position.*/public
synchronized void sync(long position) throws IOException {
  if (position+SYNC_SIZE >= end) {
    seek(end);
    return;
  }

  try {
    seek(position+4);                         // skip escape
    in.readFully(syncCheck);
    int syncLen = sync.length;
    for (int i = 0; in.getPos() < end; i++) {
      int j = 0;
      for (; j < syncLen; j++) {
        if (sync[j] != syncCheck[(i+j)%syncLen])
          break;
      }
      if (j == syncLen) {
        in.seek(in.getPos() - SYNC_SIZE);     // position before sync
        return;
      }
      syncCheck[i%syncLen] = in.readByte();
    }
  } catch (ChecksumException e) {             // checksum failure
    handleChecksumException(e);
  }}

According to my understanding, these codes simply look for a data sequence
which contain the same data as "sync marker".

My doubt:
Consider a situation where the data in a SequenceFile happen to contain a
16 bytes data sequence the same as "sync marker", the codes above will
mistakenly treat that 16-bytes data as a "sync marker" and then the
SequenceFile won't be correctly parsed?

I don't find any "escape" operation about the data or the sync marker. So,
how can SequenceFile be binary safe? Am I missing something? Please correct
me if I am wrong.

Thanks!

Shawn
+
Chris Douglas 2013-04-29, 16:50
+
Hs 2013-05-02, 10:26
+
Colin McCabe 2013-05-02, 18:29
+
Doug Cutting 2013-05-09, 16:16
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB