Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # dev - AvroStorage pig adapter


Copy link to this message
-
RE: AvroStorage pig adapter
Thiruvalluvan M. G. 2010-05-14, 07:17
Hi Scott,

As part of benchmarking program which compares the serialization and
de-serialization performance of Avro against Protocol Buffers and Thrift, I
used the following schema to represent Pig tuples. (I should publish the
code and results, I didn't find time to clean it up a bit).

{
    "type" : "record",
    "name" : "AvroTuple",
    "namespace" : "org.apache.avro.bench.pig.tuple.avro",
    "fields" :
    [
        {
            "name" : "elements",
            "type" : {
                "type" : "array",
                "items" :
                {
                    "type" : "record",
                    "name" : "Element",
                    "fields" :
                    [
                        {
                            "name" : "value",
                            "type" :
                            [
                                { "type" : "array", "items" : "AvroTuple" },
                                { "type" : "record", "name":
"AvroBigCharArray",
                                    "fields" : [ { "name": "data",
                                        "type": "string" } ] },
                                "boolean",
                                { "type" : "record", "name": "AvroByte",
                                    "fields" : [ { "name": "data",
                                        "type": "int" } ] },
                                "bytes",
                                "string",
                                "double",
                                { "type" : "record", "name": "AvroError",
                                    "fields" : [ { "name": "data",
                                        "type": "null" } ] },
                                "float",
                                "int",
                                { "type" : "record", "name":
"AvroInternalMap",
                                    "fields" : [ { "name": "data",
                                        "type": "null" } ] },
                                "long",
                                { "type" : "map", "values" : "Element" },
                                "null",
                                "AvroTuple"
                            ]
                        }
                    ]
                }
            }
        }
    ]
}

What you are looking for is the inner record called Element within the
AvroTuple. I named the outer record AvroTuple because I wrote IDLs for
Protocol Buffers and Thrift and wanted the class names to be unambiguous.

The tuple should actually be an array rather than a record. But since arrays
cannot be named in Avro, I wrapped the array with a record. Please note
wrapping objects by records in Avro does not cost anything in the binary
format. I use the same technique to represent more than one type by a single
Avro type. For instance Pig's string and Pig's BigCharArray are both
represented by Avro string. I use the a record to distinguish between them.

Does it solve your problem?

Thanks

Thiru
-----Original Message-----
From: Scott Carey [mailto:[EMAIL PROTECTED]]
Sent: Friday, May 14, 2010 11:06 AM
To: [EMAIL PROTECTED]
Subject: AvroStorage pig adapter

I'm working on some prototypes for an org.apache.avro.pig package for java
that contains avro <> pig storage adapters.

This is going fairly well.  I already have org.apache.avro.mapreduce done
for InputFormat and OutputFormat (pig requires use of the 0.20 api), once I
get testing working I'll submit a patch and JIRA for that.  We also need to
package these libraries in a different jar than the core avro content.

However there are some difficulties that I could use a little help on.   All
Pig datatypes map to Avro easily except for the Pig MAP datatype.

A Pig map, like an Avro map, must have a string as a key.  However, its
value type is essentially Object and can be any pig type.  A single pig map
might have the contents:
"city" > "San Francisco"
"elevation" > 25

Avro map values must all have the same type.  A pig schema does not define
what type is inside the map.  Other pig serializations dynamically handle
it.
In Avro, this seems straightforward at first -- the value of the map must be
a Union of all possible pig types:
  [ null, boolean, int, long, float, double, chararray, bytearray, tuple,
bag, map ]

The problem comes in with the last three.   I'm fairly sure there is no
valid Avro schema to represent this situation.  The tuple in the union can
be a tuple of any possible compostion -- avro requires defining its fields
in advance.  Likewise, the bag can contain tuples of any possible
composition.  The map has to self-reference, and there's a bit of a
chicken-egg problem there.  In order to create the map schema I have to have
the union containing it already created.

If I support only maps that contain simple value types, then pig can only
detect the failure at runtime when the output is written and a complex map
value is encountered during serialization.
I can support these arbitrary types by serializing them to byte[] via their
Writable API and storing these as an avro bytes type.  That is a hack that
I'd rather avoid but looks to be the one way out.

Unless I'm missing something, we can't serialize Pig maps in pure Avro
unless either:

* Avro adds some sort of 'dynamic' typing for records that have unknown
fields at schema create time.  For example, {"name": "unknown", "type":
"dynamic-record"} can signify an untyped collection of fields, each field in
binary can be prefixed by a type byte and the field names auto-generated
(perhaps "$0", "$1", etc).
* Pig makes their map values strictly typed, like Avro and Hive.  I'd also
like to see Pig and Avro support maps that have integer and long keys like
Hive but that is a separate concern.
On the other side -- reading an avro container file into a Pig schema, there
are a few limitations:

An Avro schema cannot be translated cleanly to Pig if:
  There is a union that is more than a union of NU