Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> AvroStorage/Avro Schema Question


Copy link to this message
-
Re: AvroStorage/Avro Schema Question
It appears as though the Avro to PigStorage schema translation names (in
pig) all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the field
name is not moved onto the bag name.

About a year and a half ago I started
https://issues.apache.org/jira/browse/AVRO-592

but before finishing it AvroStorage was written elsewhere.  I don't recall
exactly what I did with the schema translation there, but I recall the
mapping from an Avro schema to pig tried to hide the nullable wrappers more.
In Avro, arrays are unnamed types, so I see two things you could probably do
without any code changes:

* Add a line in the pig script to project / rename the fields to what you
want (unfortunate and clumbsy, but I think it will work ‹ I think you want
"from::PIG_WRAPPER::ARRAY_ELEM as from"  or
"FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
* Add a record wrapper to your schema (which may inject more messiness in
the pig schema view):
{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"record", "name":"From", "fields":
[[{"type":"array", "items":"string"},"null"]], "null"]},
       Š
    ]
}

But that is very awkward ‹ requiring a named record for each field that is
an unnamed type.
Ideally PigStorage would treat any union of null and one other thing as a
simple pig type with no wrapper, and project the name of a field or record
into the name of the thing inside a bag.
-Scott

On 3/29/12 6:05 PM, "Russell Jurney" <[EMAIL PROTECTED]> wrote:

> Is it possible to name string elements in the schema of an array?
> Specifically, below I want to name the email addresses in the
> from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
> Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
> AvroStorage UDF, but I'm hoping I can also fix it more easily in the schema.
> Last time I read Avro's array docs in this context, my hit-points dropped by a
> third, so pardom me if I've not rtfm this time :)
>
> Complete description of what I'm doing follows:
>
> Avro schema for my emails:
>
>> {
>>     "namespace": "agile.data.avro",
>>     "name": "Email",
>>     "type": "record",
>>     "fields": [
>>         {"name":"message_id", "type": ["string", "null"]},
>>         {"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
>>         {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
>>         {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
>>         {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
>>         {"name":"reply_to", "type": [{"type":"array", "items":"string"},
>> "null"]},
>>         {"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
>> "null"]},
>>         {"name":"subject", "type": ["string", "null"]},
>>         {"name":"body", "type": ["string", "null"]},
>>         {"name":"date", "type": ["string", "null"]}
>>     ]
>> }
>
> Pig to publish my Avros:
>
>> grunt> emails = load '/me/tmp/emails' using AvroStorage();
>> grunt> describe emails
>>
>> emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
>> chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
>> (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
>> chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
>> {PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body:
>> chararray,date: chararray}
>>
>> grunt> store emails into 'mongodb://localhost/agile_data.emails' using
>> MongoStorage();
>
> My emails in MongoDB:
>
>>> > db.emails.findOne()
>> {
>> "_id" : ObjectId("4f738a35414e113e75707b97"),
>> "message_id" : "<[EMAIL PROTECTED]>",
>> "from" : [
>> {
>> "ARRAY_ELEM" : "[EMAIL PROTECTED]"
>> }
>> ],
>> "to" : [
>> {
>> "ARRAY_ELEM" : "[EMAIL PROTECTED]"
>> }
>> ],