Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Record extensions?


Copy link to this message
-
Re: Record extensions?


On 6/12/12 6:09 PM, "Christophe Taton" <[EMAIL PROTECTED]> wrote:

> On Tue, Jun 12, 2012 at 11:13 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:
>> On Tue, Jun 12, 2012 at 10:38 AM, Christophe Taton <[EMAIL PROTECTED]>
>> wrote:
>>> > I need my server to handle records with fields that can be "freely"
>>> extended
>>> > by users, without requiring a recompile and restart of the server.
>>> > The server itself does not need to know how to handle the content of this
>>> > extensible field.
>>> >
>>> > One way to achieve this is to have a bytes field whose content is managed
>>> > externally, but this is very ineffective in many ways.
>>> > Is there a another way to do this with Avro?
>>
>> You could use a very generic schema, like:
>>
>> {"type":"record", "name":"Value", fields: [
>>  {"name":"value", "type": ["int","float","boolean", ...
>> {"type":"map", "values":"Value"}}
>> ]}
>>
>> This is roughly equivalent to a binary encoding of JSON.  But by using
>> a map it forces the serialization of a field name with every field
>> value.  Not only does that make payloads bigger but it also makes them
>> slower to construct and parse.
>>
>> Another approach is to include the Avro schema for a value in the record,
>> e.g.:
>>
>> {"type":"record", "name":"Extensions", fields: [
>>  {"name":"schema", type: "string"},
>>  {"name":"values", "type": {"type":"array", "items":"bytes"}}
>> ]}
>>
>> This can make things more compact when there are a lot of values.  For
>> example, this might be used in a search application where each query
>> lists the fields its interested in retrieving and each response
>> contains a list of records that match the query and contain just the
>> requested fields.  The field names are not included in each match, but
>> instead once for entire set of matches, making this faster and more
>> compact.
>>
>> Finally, if you have a stateful connection then you can send send a
>> schema in the first request then just send bytes encoding instances of
>> that schema in subsequent requests over that connection.  This again
>> avoids sending field names with each field value.
>
> Thanks for the detailed reply!
>
> In practice, I have a bunch of independent records, each of them carrying at
> most one "extension field".
>
> I was especially hoping there would be a way to avoid serializing an
> "extension" record twice (once from the record object into a bytes field, and
> then a second time as a bytes field into the destination output stream).
> Ideally, such an extension field should not require its content to be bytes,
> but should accept any record object, so that it is encoded only once.
> As I understand it, Avro does not allow me to do this right now. Is this
> correct?

If your extension field (or fields) was a union of the allowed types its
type can be detected at runtime.  If the name is dynamic as well, it can be
a pair record with name and data.  If there are multiple types then an array
or map can be used.    Lastly, the option of encoding a blob as bytes and
nesting it can be done ‹ this blob can be Avro or anything else.

I can imagine an Avro RPC server and Client API that allowed for great
flexibility in registering and responding to custom RPC types, but both the
client and server in such a situation would have to be paired up to deal
with interpreting which schema variations map to some sort of schema
resolution versus a dynamic payload.

>
> Thanks,
> Christophe
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB