Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> AvroStorage/Avro Schema Question


Copy link to this message
-
AvroStorage/Avro Schema Question
Is it possible to name string elements in the schema of an array?
 Specifically, below I want to name the email addresses in the
from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
AvroStorage UDF, but I'm hoping I can also fix it more easily in the
schema.  Last time I read Avro's array docs in this context, my hit-points
dropped by a third, so pardom me if I've not rtfm this time :)

Complete description of what I'm doing follows:

Avro schema for my emails:

{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}
    ]
}
Pig to publish my Avros:

grunt> emails = load '/me/tmp/emails' using AvroStorage();
grunt> describe emails

emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
(ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
{PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body:
chararray,date: chararray}

grunt> store emails into 'mongodb://localhost/agile_data.emails' using
MongoStorage();
My emails in MongoDB:

> db.emails.findOne()
{
"_id" : ObjectId("4f738a35414e113e75707b97"),
"message_id" : "<[EMAIL PROTECTED]>",
"from" : [
{
"ARRAY_ELEM" : "[EMAIL PROTECTED]"
}
],
"to" : [
{
"ARRAY_ELEM" : "[EMAIL PROTECTED]"
}
],
"cc" : null,
"bcc" : null,
"reply_to" : null,
"in_reply_to" : null,
"subject" : "Daily Job Change Alerts from SalesLoft",
"body" : "Daily Job Change Alerts from SalesLoft",
"date" : "2012-03-27T08:00:29"
}
My email on screen:

[image: Inline image 1]

My face when I see ARRAY_ELEM, because it means more complex presentation
code: *:(*
--
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB