Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - using Avro unions with HIVE


+
Ran S 2013-05-23, 14:15
Copy link to this message
-
Re: using Avro unions with HIVE
Scott Carey 2013-05-23, 18:45
The Hive mailing list would have more info on the Avro SerDe usage.

In general, a system that does not have union types like Hive (or Pig,
etc) has to expand a union into multiple fields if there are more than one
non-null type -- and at most one branch of the union is not null.

For example a record with fields:

  {"name":"timestamp", "type":"long", "default":-1}
  {"name":"ipAddress", "type":["IPv4", "IPv6"]}

where IPv4 and IPv6 are previously defined types, would have to expand to
three fields
 "timestamp", "ipAddress:IPv4", and "ipAddress:IPv6", where only one of
the last two is not null in any given record.

I do not know what Hive's Avro SerDe does with unions.

On 5/23/13 7:15 AM, "Ran S" <[EMAIL PROTECTED]> wrote:

>Hi,
>We started to work with Avro in CDH4 and to query the Avro files using
>Hive.
>This does work fine for us, except for unions.
>We do not understand how to query the data inside a union using Hive.
>
>For example, let's look at the following schema:
>
>{
> "type":"record",
> "name":"event",
> "namespace":"com.mysite",
> "fields":[
>    {
>        "name":"header",
>        "type":{
>            "type":"record", "name":"CommonHeader",
>            "fields":[{ "name":"eventTimeStamp", "type":"long", efault":-1
>},
>                      { "name":"globalUserId", "type":["null", "string"],
>"default":null } ]
>        },
>        "default":null
>    },
>    {
>        "name":"eventbody",
>        "type":{
>            "type":"record", "name":"eventbody",
>            "fields":[
>                {
>                    "name":"body",
>                    "type":[
>                       "null",
>                       {
>                        "type":"record",
>                        "name":"event1",
>                        "fields":[
>                            {
>                                "name":"event1Header",
>                                "type":["null", { "type":"array",
>"items":"string" }], "default":null
>                            },
>                            {
>                                "name":"event1Body",
>                                "type":["null", { "type":"array",
>"items":"string" }], "default":null
>                            }
>                        ]
>                    },
>                   {
>                        "type":"record",
>                        "name":"event2",
>                        "fields":[
>                            {
>                                "name":"page",
>                                "type":{
>                                    "type":"record", "name":"URL",
>"fields":[{ "name":"url", "type":"string" }]
>                                },
>                                "default":null
>                            },
>                            {
>                                "name":"referrer", "type":"string",
>"default":null
>                            }
>                        ]
>                    }
> ],
>                    "default":null
>                }
>            ]
>        },
>        "default":null
>    }
>]}
>
>Note that "body" is a union of three types:
>null, "event1" and "event2"
>
>So if I want to query fields inside event1, I first need to access it.
>I then set a HiveQL like this:
>SELECT eventbody.body.??? from SRC
>
>My question is: what shoule I put in the ??? above to make this work?
>
>Thank you,
>Ran
>
>
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/using-Avro-unions-with-HIVE-tp4027
>473.html
>Sent from the Avro - Users mailing list archive at Nabble.com.
+
Mark Wagner 2013-05-23, 20:08
+
Ran S 2013-05-26, 05:04