Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> serialization stability when using Avro objects as 'headers' at the start of a longer stream


Copy link to this message
-
Re: serialization stability when using Avro objects as 'headers' at the start of a longer stream
Unless you're certain that the header schema will never change and/or
that the reading and writing code will always have the same exact
version (i.e., data will not be persisted or transmitted over the
network) I would suggest that you also include some kind of version or
magic number at the start of the stream to permit you to evolve the
format of the header.

For example, a simple approach might be to have the initial four bytes
of MyFormat version 1 might be something like, ['M','F','0','1'].
Then, in your code, you might have a table like:

static final Schema[] SCHEMA_VERSIONS = { MyHeaderSchema };

When you read a stream you can find its schema in this table.

Then when you modify the header schema you can add the old schema to
the table.  This permits you to evolve the header schema.

There are lots of other ways to do this with various tradeoffs.  The
schemas could be stored in a database, you might use the Schema's
fingerprint instead of a version number, you could even put the entire
schema at the beginning of every stream.  Regardless, for any
non-ephemeral format, it's best to have the first few bytes identify
the format.

Doug

On Wed, Jan 15, 2014 at 12:15 AM, Sid Shetye <[EMAIL PROTECTED]> wrote:
> From a deserialization stability perspective, how safe is it to have an Avro serialized object at the start of a byte stream? Let's assume the rest of the stream, after this Avro serialized object, is filled with application layer data which can be anything from zero byte to a few hundred megabytes?
>
> Essentially using the Avro object as a header and the "body" being a byte stream. To illustrate via a made-up case:
>
> Offset - Data
> =========> 0x0000 - 1st byte of MyAvroHeader (serialized by Avro)
> ...
> 0x001F - Last byte of MyAvroHeader (serialized by Avro)
> 0x0020 - 1st byte of MyAppStream
> ...              // (bytes/offsets continue till end-of-stream is reached)
>
> I did a quick and simple serialization-only test (no IPC/RPC) using the C# version of Avro and this seems to work well. However, I wanted to hear from others if there are some issues with this approach.
>
> Regards
> Sid
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB