We have a integration project we are working on currently that has a lot
varying usage/requirements around serialization format. While, we typically
have a leaning towards Avro we are now considering using multiple
serialization formats (Avro, Protobuf, and JSON) to deal with some of the
challenges we are facing. I am interested in hearing if we have any bad
assumptions here or if others have considered similar tradeoffs.
In brief:

We are building a on-boarding solution that allows customers to upload
their data to our platform.  Our platform has many products associated with
it, each of which can subscribe to customer feeds through pipeline
services.  In terms of technology we are using HBase for storage and Kafka
as the means of brokering between various producers & consumers, i.e.
pipeline services.
We started with Avro because a) we had prior experience with it  b) due to
its tag less nature it offers the most efficiency wrt storage and c) strong
usage/community with Kafka.  Furthermore, we have found the reader/writer
separation to be a strong fix for ingesting data and pushing to the
However, there are several problem areas we hit with Avro.
1) The Registry Lookup - The reader/writer relationship while useful also
has a cost.  It requires that the schema either be stored with reach event
sent in the pipeline or that we pass along an identifier that can be used
to look it up/cache the schema.  While we have no issue doing the latter
within the on-boarding component itself because its isolated, we are more
hesitant to place the same cost requirements on downstream components in
the pipeline.  The ideal here would be for Avro to have an option for
tagged output, that wasn't as heavy as placing the entire schema in the
2) Variable Data Structures - A goal we have is to easily generate
cross-platform libraries to produce/consume messages in the pipeline.
 While some structures that we know at build time (e.g. MessageHeader) we
can generate code for.  We also have schemas with a high degree of
variability.  For example, the schemas for customer datasources we expect
to change often and to vary significantly between customers. There doesn't
seem to be a way to use Specific Records and Generic Records together.  In
our case part of the Record is Specific (i.e. well known at build time),
but it also contains Generic data (variable structures). There is not a way
to specify a a schema that contains a typed generic field (We would like to
be able to generate Specific Classes that had the ability to return
GenericRecords for a field). As a result we needed to store the variable
structures as blobs and write platform specific code to encode/decode.
3) Avro's JSON - To deal with 2, we attempted to use Avro's JSONEncoder and
allow clients to parse themselves, however we found  the encoding to be
 Avro specific (assumed an Avro JSONDecoder on the other end). e.g.  Unions
resulted in an additional object wrapper with type info (even for simple
unions with null).  Also, default values are also not encoded into the
JSON, since those values are derived from the schema.
Our project is still in the early stages and our preference would be to use
Avro for everything. But due to 1-3 I believe we are headed for a
best-of-breed serialization solution, where we are using Avro for
ingests/storage, Protobuf for the Pipeline/Kafka to eliminate the need for
registry lookups, and JSON for the variable data structures with our own


Shone Sadler

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB