Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> AvroStorage/Avro Schema Question


Copy link to this message
-
Re: AvroStorage/Avro Schema Question
In thinking about it more... it seems that unfortunately, the only thing I
can really do is to change the schema for all email address fields:

{"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
to:
{"name":"froms","type": [{"type":"record", "name":"from", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},

That is, to pluralize everything and then individually name array elements.
I will try running this through my stack.
On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <[EMAIL PROTECTED]> wrote:

> It appears as though the Avro to PigStorage schema translation names (in
> pig) all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the
> field name is not moved onto the bag name.
>
> About a year and a half ago I started
> https://issues.apache.org/jira/browse/AVRO-592
>
> but before finishing it AvroStorage was written elsewhere.  I don't recall
> exactly what I did with the schema translation there, but I recall the
> mapping from an Avro schema to pig tried to hide the nullable wrappers more.
>
>
> In Avro, arrays are unnamed types, so I see two things you could probably
> do without any code changes:
>
> * Add a line in the pig script to project / rename the fields to what you
> want (unfortunate and clumbsy, but I think it will work — I think you want
> "from::PIG_WRAPPER::ARRAY_ELEM as from"  or
> "FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
> * Add a record wrapper to your schema (which may inject more messiness in
> the pig schema view):
> {
>     "namespace": "agile.data.avro",
>     "name": "Email",
>     "type": "record",
>     "fields": [
>         {"name":"message_id", "type": ["string", "null"]},
>         {"name":"from","type": [{"type":"record", "name":"From", "fields":
> [[{"type":"array", "items":"string"},"null"]], "null"]},
>        …
>     ]
> }
>
> But that is very awkward — requiring a named record for each field that is
> an unnamed type.
>
>
> Ideally PigStorage would treat any union of null and one other thing as a
> simple pig type with no wrapper, and project the name of a field or record
> into the name of the thing inside a bag.
>
>
> -Scott
>
> On 3/29/12 6:05 PM, "Russell Jurney" <[EMAIL PROTECTED]> wrote:
>
> Is it possible to name string elements in the schema of an array?
>  Specifically, below I want to name the email addresses in the
> from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
> Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
> AvroStorage UDF, but I'm hoping I can also fix it more easily in the
> schema.  Last time I read Avro's array docs in this context, my hit-points
> dropped by a third, so pardom me if I've not rtfm this time :)
>
> Complete description of what I'm doing follows:
>
> Avro schema for my emails:
>
> {
>     "namespace": "agile.data.avro",
>     "name": "Email",
>     "type": "record",
>     "fields": [
>         {"name":"message_id", "type": ["string", "null"]},
>         {"name":"from","type": [{"type":"array", "items":"string"},
> "null"]},
>         {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
>         {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
>         {"name":"bcc","type": [{"type":"array", "items":"string"},
> "null"]},
>         {"name":"reply_to", "type": [{"type":"array", "items":"string"},
> "null"]},
>         {"name":"in_reply_to", "type": [{"type":"array",
> "items":"string"}, "null"]},
>         {"name":"subject", "type": ["string", "null"]},
>         {"name":"body", "type": ["string", "null"]},
>         {"name":"date", "type": ["string", "null"]}
>     ]
> }
>
>
> Pig to publish my Avros:
>
> grunt> emails = load '/me/tmp/emails' using AvroStorage();
> grunt> describe emails
>
> emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
> chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
> (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
> chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB