Josh Spiegel 2013-01-23, 21:09
Martin Kleppmann 2013-01-24, 12:47
-Re: Synchronization Markers
Josh Spiegel 2013-01-24, 15:24
Ok, makes sense. Thanks for the answer.
On Thu, Jan 24, 2013 at 4:47 AM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:
> 1. Because if it was predictable, it would inevitably appear in the
> actual data sometimes (e.g. imagine the Avro documentation, stating
> what the sync marker is, is downloaded by a web crawler and stored in
> an Avro data file; then the sync marker will appear in the actual
> data). Data may come from malicious sources; making the marker random
> makes it unfeasible to exploit.
> 2. Possibly, but extremely unlikely. The probability of a given random
> 16-byte string appearing in a petabyte of (uniformly distributed) data
> is about 10^-23. It's more likely that your data center is wiped out
> by a meteorite (http://preshing.com/20110504/hash-collision-probabilities
> 3. If the sync marker appears in your data, it only breaks reading the
> file if you happen to also seek to that place in the file. If you just
> read over it sequentially, nothing happens.
> On 23 January 2013 21:09, Josh Spiegel <[EMAIL PROTECTED]> wrote:
> > As I understand it, Avro container files contain synchronization markers
> > every so often to support splitting the file. See:
> > (1) Why isn't the synchronization marker the same for every container
> > (i.e. what is the point of generating it randomly every time)
> > (2) Is it possible, at least in theory, for naturally occurring data to
> > contain bytes that match the sync marker? If so, would this break
> > synchronization?
> > Thanks,
> > Josh