Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Record extensions?

Copy link to this message
Re: Record extensions?

On 6/12/12 6:09 PM, "Christophe Taton" <[EMAIL PROTECTED]> wrote:

> On Tue, Jun 12, 2012 at 11:13 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:
>> On Tue, Jun 12, 2012 at 10:38 AM, Christophe Taton <[EMAIL PROTECTED]>
>> wrote:
>>> > I need my server to handle records with fields that can be "freely"
>>> extended
>>> > by users, without requiring a recompile and restart of the server.
>>> > The server itself does not need to know how to handle the content of this
>>> > extensible field.
>>> >
>>> > One way to achieve this is to have a bytes field whose content is managed
>>> > externally, but this is very ineffective in many ways.
>>> > Is there a another way to do this with Avro?
>> You could use a very generic schema, like:
>> {"type":"record", "name":"Value", fields: [
>>  {"name":"value", "type": ["int","float","boolean", ...
>> {"type":"map", "values":"Value"}}
>> ]}
>> This is roughly equivalent to a binary encoding of JSON.  But by using
>> a map it forces the serialization of a field name with every field
>> value.  Not only does that make payloads bigger but it also makes them
>> slower to construct and parse.
>> Another approach is to include the Avro schema for a value in the record,
>> e.g.:
>> {"type":"record", "name":"Extensions", fields: [
>>  {"name":"schema", type: "string"},
>>  {"name":"values", "type": {"type":"array", "items":"bytes"}}
>> ]}
>> This can make things more compact when there are a lot of values.  For
>> example, this might be used in a search application where each query
>> lists the fields its interested in retrieving and each response
>> contains a list of records that match the query and contain just the
>> requested fields.  The field names are not included in each match, but
>> instead once for entire set of matches, making this faster and more
>> compact.
>> Finally, if you have a stateful connection then you can send send a
>> schema in the first request then just send bytes encoding instances of
>> that schema in subsequent requests over that connection.  This again
>> avoids sending field names with each field value.
> Thanks for the detailed reply!
> In practice, I have a bunch of independent records, each of them carrying at
> most one "extension field".
> I was especially hoping there would be a way to avoid serializing an
> "extension" record twice (once from the record object into a bytes field, and
> then a second time as a bytes field into the destination output stream).
> Ideally, such an extension field should not require its content to be bytes,
> but should accept any record object, so that it is encoded only once.
> As I understand it, Avro does not allow me to do this right now. Is this
> correct?

If your extension field (or fields) was a union of the allowed types its
type can be detected at runtime.  If the name is dynamic as well, it can be
a pair record with name and data.  If there are multiple types then an array
or map can be used.    Lastly, the option of encoding a blob as bytes and
nesting it can be done ‹ this blob can be Avro or anything else.

I can imagine an Avro RPC server and Client API that allowed for great
flexibility in registering and responding to custom RPC types, but both the
client and server in such a situation would have to be paired up to deal
with interpreting which schema variations map to some sort of schema
resolution versus a dynamic payload.

> Thanks,
> Christophe