Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Synchronization Markers


+
Josh Spiegel 2013-01-23, 21:09
+
Martin Kleppmann 2013-01-24, 12:47
Copy link to this message
-
Re: Synchronization Markers
Ok, makes sense.  Thanks for the answer.
On Thu, Jan 24, 2013 at 4:47 AM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:

> 1. Because if it was predictable, it would inevitably appear in the
> actual data sometimes (e.g. imagine the Avro documentation, stating
> what the sync marker is, is downloaded by a web crawler and stored in
> an Avro data file; then the sync marker will appear in the actual
> data). Data may come from malicious sources; making the marker random
> makes it unfeasible to exploit.
>
> 2. Possibly, but extremely unlikely. The probability of a given random
> 16-byte string appearing in a petabyte of (uniformly distributed) data
> is about 10^-23. It's more likely that your data center is wiped out
> by a meteorite (http://preshing.com/20110504/hash-collision-probabilities
> ).
>
> 3. If the sync marker appears in your data, it only breaks reading the
> file if you happen to also seek to that place in the file. If you just
> read over it sequentially, nothing happens.
>
> Martin
>
> On 23 January 2013 21:09, Josh Spiegel <[EMAIL PROTECTED]> wrote:
> > As I understand it, Avro container files contain synchronization markers
> > every so often to support splitting the file.  See:
> >
> https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F
> >
> > (1) Why isn't the synchronization marker the same for every container
> file?
> > (i.e. what is the point of generating it randomly every time)
> >
> > (2) Is it possible, at least in theory, for naturally occurring data to
> > contain bytes that match the sync marker? If so, would this break
> > synchronization?
> >
> > Thanks,
> > Josh
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB