1. Because if it was predictable, it would inevitably appear in the
actual data sometimes (e.g. imagine the Avro documentation, stating
what the sync marker is, is downloaded by a web crawler and stored in
an Avro data file; then the sync marker will appear in the actual
data). Data may come from malicious sources; making the marker random
makes it unfeasible to exploit.
2. Possibly, but extremely unlikely. The probability of a given random
16-byte string appearing in a petabyte of (uniformly distributed) data
is about 10^-23. It's more likely that your data center is wiped out
by a meteorite (http://preshing.com/20110504/hash-collision-probabilities).
3. If the sync marker appears in your data, it only breaks reading the
file if you happen to also seek to that place in the file. If you just
read over it sequentially, nothing happens.
On 23 January 2013 21:09, Josh Spiegel <[EMAIL PROTECTED]> wrote:
> As I understand it, Avro container files contain synchronization markers
> every so often to support splitting the file. See:
> (1) Why isn't the synchronization marker the same for every container file?
> (i.e. what is the point of generating it randomly every time)
> (2) Is it possible, at least in theory, for naturally occurring data to
> contain bytes that match the sync marker? If so, would this break