Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> AvroStorage/Avro Schema Question


Copy link to this message
-
Re: AvroStorage/Avro Schema Question
In thinking about it more... it seems that unfortunately, the only thing I
can really do is to change the schema for all email address fields:

{"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
to:
{"name":"froms","type": [{"type":"record", "name":"from", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},

That is, to pluralize everything and then individually name array elements.
I will try running this through my stack.
On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <[EMAIL PROTECTED]> wrote:

> It appears as though the Avro to PigStorage schema translation names (in
> pig) all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the
> field name is not moved onto the bag name.
>
> About a year and a half ago I started
> https://issues.apache.org/jira/browse/AVRO-592
>
> but before finishing it AvroStorage was written elsewhere.  I don't recall
> exactly what I did with the schema translation there, but I recall the
> mapping from an Avro schema to pig tried to hide the nullable wrappers more.
>
>
> In Avro, arrays are unnamed types, so I see two things you could probably
> do without any code changes:
>
> * Add a line in the pig script to project / rename the fields to what you
> want (unfortunate and clumbsy, but I think it will work — I think you want
> "from::PIG_WRAPPER::ARRAY_ELEM as from"  or
> "FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
> * Add a record wrapper to your schema (which may inject more messiness in
> the pig schema view):
> {
>     "namespace": "agile.data.avro",
>     "name": "Email",
>     "type": "record",
>     "fields": [
>         {"name":"message_id", "type": ["string", "null"]},
>         {"name":"from","type": [{"type":"record", "name":"From", "fields":
> [[{"type":"array", "items":"string"},"null"]], "null"]},
>        …
>     ]
> }
>
> But that is very awkward — requiring a named record for each field that is
> an unnamed type.
>
>
> Ideally PigStorage would treat any union of null and one other thing as a
> simple pig type with no wrapper, and project the name of a field or record
> into the name of the thing inside a bag.
>
>
> -Scott
>
> On 3/29/12 6:05 PM, "Russell Jurney" <[EMAIL PROTECTED]> wrote:
>
> Is it possible to name string elements in the schema of an array?
>  Specifically, below I want to name the email addresses in the
> from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
> Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
> AvroStorage UDF, but I'm hoping I can also fix it more easily in the
> schema.  Last time I read Avro's array docs in this context, my hit-points
> dropped by a third, so pardom me if I've not rtfm this time :)
>
> Complete description of what I'm doing follows:
>
> Avro schema for my emails:
>
> {
>     "namespace": "agile.data.avro",
>     "name": "Email",
>     "type": "record",
>     "fields": [
>         {"name":"message_id", "type": ["string", "null"]},
>         {"name":"from","type": [{"type":"array", "items":"string"},
> "null"]},
>         {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
>         {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
>         {"name":"bcc","type": [{"type":"array", "items":"string"},
> "null"]},
>         {"name":"reply_to", "type": [{"type":"array", "items":"string"},
> "null"]},
>         {"name":"in_reply_to", "type": [{"type":"array",
> "items":"string"}, "null"]},
>         {"name":"subject", "type": ["string", "null"]},
>         {"name":"body", "type": ["string", "null"]},
>         {"name":"date", "type": ["string", "null"]}
>     ]
> }
>
>
> Pig to publish my Avros:
>
> grunt> emails = load '/me/tmp/emails' using AvroStorage();
> grunt> describe emails
>
> emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
> chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
> (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
> chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com