Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - AvroStorage/Avro Schema Question


+
Russell Jurney 2012-03-30, 01:05
+
Scott Carey 2012-04-02, 16:13
+
Russell Jurney 2012-04-10, 09:26
Copy link to this message
-
Re: AvroStorage/Avro Schema Question
Russell Jurney 2012-04-10, 09:36
Hmmmm unable to get this to work:

{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"froms","type": [{"type":"record", "name":"from", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"tos","type": [{"type":"record", "name":"to", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"ccs","type": [{"type":"record", "name":"cc", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"bccs","type": [{"type":"record", "name":"bcc", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"reply_tos","type": [{"type":"record", "name":"reply_to",
"fields": [{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}
    ]
}

On Tue, Apr 10, 2012 at 2:26 AM, Russell Jurney <[EMAIL PROTECTED]>wrote:

> In thinking about it more... it seems that unfortunately, the only thing I
> can really do is to change the schema for all email address fields:
>
> {"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
> to:
> {"name":"froms","type": [{"type":"record", "name":"from", "fields":
> [{"type":"array", "items":"string"}, "null"]}, "null"]},
>
> That is, to pluralize everything and then individually name array
> elements. I will try running this through my stack.
>
>
> On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <[EMAIL PROTECTED]> wrote:
>
>> It appears as though the Avro to PigStorage schema translation names (in
>> pig) all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the
>> field name is not moved onto the bag name.
>>
>> About a year and a half ago I started
>> https://issues.apache.org/jira/browse/AVRO-592
>>
>> but before finishing it AvroStorage was written elsewhere.  I don't
>> recall exactly what I did with the schema translation there, but I recall
>> the mapping from an Avro schema to pig tried to hide the nullable wrappers
>> more.
>>
>>
>> In Avro, arrays are unnamed types, so I see two things you could probably
>> do without any code changes:
>>
>> * Add a line in the pig script to project / rename the fields to what you
>> want (unfortunate and clumbsy, but I think it will work — I think you want
>> "from::PIG_WRAPPER::ARRAY_ELEM as from"  or
>> "FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
>> * Add a record wrapper to your schema (which may inject more messiness in
>> the pig schema view):
>> {
>>     "namespace": "agile.data.avro",
>>     "name": "Email",
>>     "type": "record",
>>     "fields": [
>>         {"name":"message_id", "type": ["string", "null"]},
>>         {"name":"from","type": [{"type":"record", "name":"From",
>> "fields": [[{"type":"array", "items":"string"},"null"]], "null"]},
>>        …
>>     ]
>> }
>>
>> But that is very awkward — requiring a named record for each field that
>> is an unnamed type.
>>
>>
>> Ideally PigStorage would treat any union of null and one other thing as a
>> simple pig type with no wrapper, and project the name of a field or record
>> into the name of the thing inside a bag.
>>
>>
>> -Scott
>>
>> On 3/29/12 6:05 PM, "Russell Jurney" <[EMAIL PROTECTED]> wrote:
>>
>> Is it possible to name string elements in the schema of an array?
>>  Specifically, below I want to name the email addresses in the
>> from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
>> Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
>> AvroStorage UDF, but I'm hoping I can also fix it more easily in the
>> schema.  Last time I read Avro's array docs in this context, my hit-points
>> dropped by a third, so pardom me if I've not rtfm this time :)

Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Russell Jurney 2012-04-18, 02:30