|
|
-
AvroStorage/Avro Schema Question
Russell Jurney 2012-03-30, 01:05
Is it possible to name string elements in the schema of an array? Specifically, below I want to name the email addresses in the from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by Pig's AvroStorage. I know I can probably fix this in Java in the Pig AvroStorage UDF, but I'm hoping I can also fix it more easily in the schema. Last time I read Avro's array docs in this context, my hit-points dropped by a third, so pardom me if I've not rtfm this time :)
Complete description of what I'm doing follows:
Avro schema for my emails:
{ "namespace": "agile.data.avro", "name": "Email", "type": "record", "fields": [ {"name":"message_id", "type": ["string", "null"]}, {"name":"from","type": [{"type":"array", "items":"string"}, "null"]}, {"name":"to","type": [{"type":"array", "items":"string"}, "null"]}, {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]}, {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]}, {"name":"reply_to", "type": [{"type":"array", "items":"string"}, "null"]}, {"name":"in_reply_to", "type": [{"type":"array", "items":"string"}, "null"]}, {"name":"subject", "type": ["string", "null"]}, {"name":"body", "type": ["string", "null"]}, {"name":"date", "type": ["string", "null"]} ] } Pig to publish my Avros:
grunt> emails = load '/me/tmp/emails' using AvroStorage(); grunt> describe emails
emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body: chararray,date: chararray}
grunt> store emails into 'mongodb://localhost/agile_data.emails' using MongoStorage(); My emails in MongoDB:
> db.emails.findOne() { "_id" : ObjectId("4f738a35414e113e75707b97"), "message_id" : "<[EMAIL PROTECTED]>", "from" : [ { "ARRAY_ELEM" : "[EMAIL PROTECTED]" } ], "to" : [ { "ARRAY_ELEM" : "[EMAIL PROTECTED]" } ], "cc" : null, "bcc" : null, "reply_to" : null, "in_reply_to" : null, "subject" : "Daily Job Change Alerts from SalesLoft", "body" : "Daily Job Change Alerts from SalesLoft", "date" : "2012-03-27T08:00:29" } My email on screen:
[image: Inline image 1]
My face when I see ARRAY_ELEM, because it means more complex presentation code: *:(* -- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Russell Jurney 2012-03-30, 01:05
-
Re: AvroStorage/Avro Schema Question
Scott Carey 2012-04-02, 16:13
It appears as though the Avro to PigStorage schema translation names (in pig) all arrays ARRAY_ELEM. The nullable wrapper is 'visible' and the field name is not moved onto the bag name. About a year and a half ago I started https://issues.apache.org/jira/browse/AVRO-592but before finishing it AvroStorage was written elsewhere. I don't recall exactly what I did with the schema translation there, but I recall the mapping from an Avro schema to pig tried to hide the nullable wrappers more. In Avro, arrays are unnamed types, so I see two things you could probably do without any code changes: * Add a line in the pig script to project / rename the fields to what you want (unfortunate and clumbsy, but I think it will work I think you want "from::PIG_WRAPPER::ARRAY_ELEM as from" or "FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that. * Add a record wrapper to your schema (which may inject more messiness in the pig schema view): { "namespace": "agile.data.avro", "name": "Email", "type": "record", "fields": [ {"name":"message_id", "type": ["string", "null"]}, {"name":"from","type": [{"type":"record", "name":"From", "fields": [[{"type":"array", "items":"string"},"null"]], "null"]}, ] } But that is very awkward requiring a named record for each field that is an unnamed type. Ideally PigStorage would treat any union of null and one other thing as a simple pig type with no wrapper, and project the name of a field or record into the name of the thing inside a bag. -Scott On 3/29/12 6:05 PM, "Russell Jurney" <[EMAIL PROTECTED]> wrote: > Is it possible to name string elements in the schema of an array? > Specifically, below I want to name the email addresses in the > from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by > Pig's AvroStorage. I know I can probably fix this in Java in the Pig > AvroStorage UDF, but I'm hoping I can also fix it more easily in the schema. > Last time I read Avro's array docs in this context, my hit-points dropped by a > third, so pardom me if I've not rtfm this time :) > > Complete description of what I'm doing follows: > > Avro schema for my emails: > >> { >> "namespace": "agile.data.avro", >> "name": "Email", >> "type": "record", >> "fields": [ >> {"name":"message_id", "type": ["string", "null"]}, >> {"name":"from","type": [{"type":"array", "items":"string"}, "null"]}, >> {"name":"to","type": [{"type":"array", "items":"string"}, "null"]}, >> {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]}, >> {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]}, >> {"name":"reply_to", "type": [{"type":"array", "items":"string"}, >> "null"]}, >> {"name":"in_reply_to", "type": [{"type":"array", "items":"string"}, >> "null"]}, >> {"name":"subject", "type": ["string", "null"]}, >> {"name":"body", "type": ["string", "null"]}, >> {"name":"date", "type": ["string", "null"]} >> ] >> } > > Pig to publish my Avros: > >> grunt> emails = load '/me/tmp/emails' using AvroStorage(); >> grunt> describe emails >> >> emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM: >> chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER: >> (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM: >> chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to: >> {PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body: >> chararray,date: chararray} >> >> grunt> store emails into 'mongodb://localhost/agile_data.emails' using >> MongoStorage(); > > My emails in MongoDB: > >>> > db.emails.findOne() >> { >> "_id" : ObjectId("4f738a35414e113e75707b97"), >> "message_id" : "<[EMAIL PROTECTED]>", >> "from" : [ >> { >> "ARRAY_ELEM" : "[EMAIL PROTECTED]" >> } >> ], >> "to" : [ >> { >> "ARRAY_ELEM" : "[EMAIL PROTECTED]" >> } >> ],
+
Scott Carey 2012-04-02, 16:13
-
Re: AvroStorage/Avro Schema Question
Russell Jurney 2012-04-10, 09:26
In thinking about it more... it seems that unfortunately, the only thing I can really do is to change the schema for all email address fields: {"name":"from","type": [{"type":"array", "items":"string"}, "null"]}, to: {"name":"froms","type": [{"type":"record", "name":"from", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, That is, to pluralize everything and then individually name array elements. I will try running this through my stack. On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <[EMAIL PROTECTED]> wrote: > It appears as though the Avro to PigStorage schema translation names (in > pig) all arrays ARRAY_ELEM. The nullable wrapper is 'visible' and the > field name is not moved onto the bag name. > > About a year and a half ago I started > https://issues.apache.org/jira/browse/AVRO-592> > but before finishing it AvroStorage was written elsewhere. I don't recall > exactly what I did with the schema translation there, but I recall the > mapping from an Avro schema to pig tried to hide the nullable wrappers more. > > > In Avro, arrays are unnamed types, so I see two things you could probably > do without any code changes: > > * Add a line in the pig script to project / rename the fields to what you > want (unfortunate and clumbsy, but I think it will work — I think you want > "from::PIG_WRAPPER::ARRAY_ELEM as from" or > "FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that. > * Add a record wrapper to your schema (which may inject more messiness in > the pig schema view): > { > "namespace": "agile.data.avro", > "name": "Email", > "type": "record", > "fields": [ > {"name":"message_id", "type": ["string", "null"]}, > {"name":"from","type": [{"type":"record", "name":"From", "fields": > [[{"type":"array", "items":"string"},"null"]], "null"]}, > … > ] > } > > But that is very awkward — requiring a named record for each field that is > an unnamed type. > > > Ideally PigStorage would treat any union of null and one other thing as a > simple pig type with no wrapper, and project the name of a field or record > into the name of the thing inside a bag. > > > -Scott > > On 3/29/12 6:05 PM, "Russell Jurney" <[EMAIL PROTECTED]> wrote: > > Is it possible to name string elements in the schema of an array? > Specifically, below I want to name the email addresses in the > from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by > Pig's AvroStorage. I know I can probably fix this in Java in the Pig > AvroStorage UDF, but I'm hoping I can also fix it more easily in the > schema. Last time I read Avro's array docs in this context, my hit-points > dropped by a third, so pardom me if I've not rtfm this time :) > > Complete description of what I'm doing follows: > > Avro schema for my emails: > > { > "namespace": "agile.data.avro", > "name": "Email", > "type": "record", > "fields": [ > {"name":"message_id", "type": ["string", "null"]}, > {"name":"from","type": [{"type":"array", "items":"string"}, > "null"]}, > {"name":"to","type": [{"type":"array", "items":"string"}, "null"]}, > {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]}, > {"name":"bcc","type": [{"type":"array", "items":"string"}, > "null"]}, > {"name":"reply_to", "type": [{"type":"array", "items":"string"}, > "null"]}, > {"name":"in_reply_to", "type": [{"type":"array", > "items":"string"}, "null"]}, > {"name":"subject", "type": ["string", "null"]}, > {"name":"body", "type": ["string", "null"]}, > {"name":"date", "type": ["string", "null"]} > ] > } > > > Pig to publish my Avros: > > grunt> emails = load '/me/tmp/emails' using AvroStorage(); > grunt> describe emails > > emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM: > chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER: > (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM: > chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to: Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Russell Jurney 2012-04-10, 09:26
-
Re: AvroStorage/Avro Schema Question
Russell Jurney 2012-04-10, 09:36
Hmmmm unable to get this to work: { "namespace": "agile.data.avro", "name": "Email", "type": "record", "fields": [ {"name":"message_id", "type": ["string", "null"]}, {"name":"froms","type": [{"type":"record", "name":"from", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, {"name":"tos","type": [{"type":"record", "name":"to", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, {"name":"ccs","type": [{"type":"record", "name":"cc", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, {"name":"bccs","type": [{"type":"record", "name":"bcc", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, {"name":"reply_tos","type": [{"type":"record", "name":"reply_to", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, {"name":"in_reply_to", "type": [{"type":"array", "items":"string"}, "null"]}, {"name":"subject", "type": ["string", "null"]}, {"name":"body", "type": ["string", "null"]}, {"name":"date", "type": ["string", "null"]} ] } On Tue, Apr 10, 2012 at 2:26 AM, Russell Jurney <[EMAIL PROTECTED]>wrote: > In thinking about it more... it seems that unfortunately, the only thing I > can really do is to change the schema for all email address fields: > > {"name":"from","type": [{"type":"array", "items":"string"}, "null"]}, > to: > {"name":"froms","type": [{"type":"record", "name":"from", "fields": > [{"type":"array", "items":"string"}, "null"]}, "null"]}, > > That is, to pluralize everything and then individually name array > elements. I will try running this through my stack. > > > On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <[EMAIL PROTECTED]> wrote: > >> It appears as though the Avro to PigStorage schema translation names (in >> pig) all arrays ARRAY_ELEM. The nullable wrapper is 'visible' and the >> field name is not moved onto the bag name. >> >> About a year and a half ago I started >> https://issues.apache.org/jira/browse/AVRO-592>> >> but before finishing it AvroStorage was written elsewhere. I don't >> recall exactly what I did with the schema translation there, but I recall >> the mapping from an Avro schema to pig tried to hide the nullable wrappers >> more. >> >> >> In Avro, arrays are unnamed types, so I see two things you could probably >> do without any code changes: >> >> * Add a line in the pig script to project / rename the fields to what you >> want (unfortunate and clumbsy, but I think it will work — I think you want >> "from::PIG_WRAPPER::ARRAY_ELEM as from" or >> "FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that. >> * Add a record wrapper to your schema (which may inject more messiness in >> the pig schema view): >> { >> "namespace": "agile.data.avro", >> "name": "Email", >> "type": "record", >> "fields": [ >> {"name":"message_id", "type": ["string", "null"]}, >> {"name":"from","type": [{"type":"record", "name":"From", >> "fields": [[{"type":"array", "items":"string"},"null"]], "null"]}, >> … >> ] >> } >> >> But that is very awkward — requiring a named record for each field that >> is an unnamed type. >> >> >> Ideally PigStorage would treat any union of null and one other thing as a >> simple pig type with no wrapper, and project the name of a field or record >> into the name of the thing inside a bag. >> >> >> -Scott >> >> On 3/29/12 6:05 PM, "Russell Jurney" <[EMAIL PROTECTED]> wrote: >> >> Is it possible to name string elements in the schema of an array? >> Specifically, below I want to name the email addresses in the >> from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by >> Pig's AvroStorage. I know I can probably fix this in Java in the Pig >> AvroStorage UDF, but I'm hoping I can also fix it more easily in the >> schema. Last time I read Avro's array docs in this context, my hit-points >> dropped by a third, so pardom me if I've not rtfm this time :) Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Russell Jurney 2012-04-10, 09:36
-
Re: AvroStorage/Avro Schema Question
Russell Jurney 2012-04-18, 02:30
The fix was this:
{ "type":"record", "name":"Email", "fields": [ { "name":"message_id", "type":["null","string"], "doc":"" }, { "name":"in_reply_to", "type": ["string", "null"] }, { "name":"subject", "type": ["string", "null"] }, { "name":"body", "type": ["string", "null"] }, { "name":"date", "type": ["string", "null"] }, { "name":"froms", "type": [ "null", { "type":"array", "items": [ "null", { "type":"record", "name":"from", "fields": [ { "name":"real_name", "type":["null","string"], "doc":"" }, { "name":"address", "type":["null","string"], "doc":"" } ] } ] } ], "doc":"" }, { "name":"tos", "type": [ "null", { "type":"array", "items": [ "null", { "type":"record", "name":"to", "fields": [ { "name":"real_name", "type":["null","string"], "doc":"" }, { "name":"address", "type":["null","string"], "doc":"" } ] } ] } ], "doc":"" }, { "name":"ccs", "type": [ "null", { "type":"array", "items": [ "null", { "type":"record", "name":"cc", "fields": [ { "name":"real_name", "type":["null","string"], "doc":"" }, { "name":"address", "type":["null","string"], "doc":"" } ] } ] } ], "doc":"" }, { "name":"bccs", "type": [ "null", { "type":"array", "items": [ "null", { "type":"record", "name":"bcc", "fields": [ { "name":"real_name", "type":["null","string"], "doc":"" }, { "name":"address", "type":["null","string"], "doc":"" } ] } ] } ], "doc":"" }, { "name":"reply_tos", "type": [ "null", { "type":"array", "items": [ "null", { "type":"record", "name":"reply_to", "fields": [ { "name":"real_name", "type":["null","string"], "doc":"" }, { "name":"address", "type":["null","string"], "doc":"" } ] } ] } ], "doc":"" } ] }
On Tue, Apr 10, 2012 at 2:36 AM, Russell Jurney <[EMAIL PROTECTED]> wrote: Hmmmm unable to get this to work:
{ "namespace": "agile.data.avro", "name": "Email", "type": "record", "fields": [ {"name"
+
Russell Jurney 2012-04-18, 02:30
|
|