Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - Synchronization Markers


+
Josh Spiegel 2013-01-23, 21:09
+
Martin Kleppmann 2013-01-24, 12:47
Copy link to this message
-
Re: Synchronization Markers
Josh Spiegel 2013-01-24, 15:24
Ok, makes sense.  Thanks for the answer.
On Thu, Jan 24, 2013 at 4:47 AM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:

> 1. Because if it was predictable, it would inevitably appear in the
> actual data sometimes (e.g. imagine the Avro documentation, stating
> what the sync marker is, is downloaded by a web crawler and stored in
> an Avro data file; then the sync marker will appear in the actual
> data). Data may come from malicious sources; making the marker random
> makes it unfeasible to exploit.
>
> 2. Possibly, but extremely unlikely. The probability of a given random
> 16-byte string appearing in a petabyte of (uniformly distributed) data
> is about 10^-23. It's more likely that your data center is wiped out
> by a meteorite (http://preshing.com/20110504/hash-collision-probabilities
> ).
>
> 3. If the sync marker appears in your data, it only breaks reading the
> file if you happen to also seek to that place in the file. If you just
> read over it sequentially, nothing happens.
>
> Martin
>
> On 23 January 2013 21:09, Josh Spiegel <[EMAIL PROTECTED]> wrote:
> > As I understand it, Avro container files contain synchronization markers
> > every so often to support splitting the file.  See:
> >
> https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F
> >
> > (1) Why isn't the synchronization marker the same for every container
> file?
> > (i.e. what is the point of generating it randomly every time)
> >
> > (2) Is it possible, at least in theory, for naturally occurring data to
> > contain bytes that match the sync marker? If so, would this break
> > synchronization?
> >
> > Thanks,
> > Josh
>