Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - Record extensions?


+
Christophe Taton 2012-06-12, 17:38
+
Doug Cutting 2012-06-12, 18:13
Copy link to this message
-
Re: Record extensions?
Christophe Taton 2012-06-13, 01:09
On Tue, Jun 12, 2012 at 11:13 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:

> On Tue, Jun 12, 2012 at 10:38 AM, Christophe Taton <[EMAIL PROTECTED]>
> wrote:
> > I need my server to handle records with fields that can be "freely"
> extended
> > by users, without requiring a recompile and restart of the server.
> > The server itself does not need to know how to handle the content of this
> > extensible field.
> >
> > One way to achieve this is to have a bytes field whose content is managed
> > externally, but this is very ineffective in many ways.
> > Is there a another way to do this with Avro?
>
> You could use a very generic schema, like:
>
> {"type":"record", "name":"Value", fields: [
>  {"name":"value", "type": ["int","float","boolean", ...
> {"type":"map", "values":"Value"}}
> ]}
>
> This is roughly equivalent to a binary encoding of JSON.  But by using
> a map it forces the serialization of a field name with every field
> value.  Not only does that make payloads bigger but it also makes them
> slower to construct and parse.
>
> Another approach is to include the Avro schema for a value in the record,
> e.g.:
>
> {"type":"record", "name":"Extensions", fields: [
>  {"name":"schema", type: "string"},
>  {"name":"values", "type": {"type":"array", "items":"bytes"}}
> ]}
>
> This can make things more compact when there are a lot of values.  For
> example, this might be used in a search application where each query
> lists the fields its interested in retrieving and each response
> contains a list of records that match the query and contain just the
> requested fields.  The field names are not included in each match, but
> instead once for entire set of matches, making this faster and more
> compact.
>
> Finally, if you have a stateful connection then you can send send a
> schema in the first request then just send bytes encoding instances of
> that schema in subsequent requests over that connection.  This again
> avoids sending field names with each field value.
Thanks for the detailed reply!

In practice, I have a bunch of independent records, each of them carrying
at most one "extension field".

I was especially hoping there would be a way to avoid serializing an
"extension" record twice (once from the record object into a bytes field,
and then a second time as a bytes field into the destination output
stream). Ideally, such an extension field should not require its content to
be bytes, but should accept any record object, so that it is encoded only
once.
As I understand it, Avro does not allow me to do this right now. Is this
correct?

Thanks,
Christophe
+
Doug Cutting 2012-06-14, 17:10
+
Doug Cutting 2012-06-14, 17:25
+
Scott Carey 2012-06-18, 17:00
+
Tatu Saloranta 2012-06-12, 17:43