Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Bucketing external tables


Copy link to this message
-
Re: Bucketing external tables
Sadananda Hegde 2013-04-05, 22:02
Thanks, Mark.

I found the problem. For some reason, Hive is not able to write Avro output
file when the schema has a complex field with NULL option. It read without
any problem; but cannot write with that structure.  For example,  Insert
was failing on this array of structure field.

{ "name": "Passenger", "type":
                       [{"type":"array","items":
                           {"type":"record",
                             "name": "PAXStruct",
                             "fields": [
                                       { "name":"PAXCode",
"type":["string", "null"] },
                                       {
"name":"PAXQuantity","type":["int", "null"] }
                                       ]
                           }
                        }, "null"]
     }

I removed the last "null" clause and it's working okay now.

Regards,
Sadu
On Thu, Apr 4, 2013 at 12:36 AM, Mark Grover <[EMAIL PROTECTED]>wrote:

> Can you please check your Jobtracker logs? The is a generic error related
> to grabbing the Task Attempt Log URL, the real error is in JT logs.
>
>
> On Wed, Apr 3, 2013 at 7:17 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote:
>
>> Hi Dean,
>>
>> I tried inserting a bucketed hive table from a non-bucketed table using
>> insert overwrite .... select from clause; but I get the following error.
>>
>> ----------------------------------------------------------------------------------
>> Exception in thread "Thread-225" java.lang.NullPointerException
>>         at
>> org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Shims.java:44)
>>         at
>> org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(JobDebugger.java:186)
>>         at
>> org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger.java:142)
>>         at java.lang.Thread.run(Thread.java:662)
>> FAILED: Execution Error, return code 2 from
>> org.apache.hadoop.hive.ql.exec.MapRedTask
>>
>> --------------------------------------------------------------------------------------------------------------------------
>>
>> Both tables have same structure except that that one has CLUSTERED BY
>> CLAUSE and other not.
>>
>> Some columns are defined as Array of Structs. The Insert statement works
>> fine if I take out those complex columns. Are there any known issues
>> loading STRUCT or ARRAY OF STRUCT fields?
>>
>>
>> Thanks for your time and help.
>>
>> Sadu
>>
>>
>>
>>
>> On Sat, Mar 30, 2013 at 7:00 PM, Dean Wampler <
>> [EMAIL PROTECTED]> wrote:
>>
>>> The table can be external. You should be able to use this data with
>>> other tools, because all bucketing does is ensure that all occurrences for
>>> records with a given key are written into the same block. This is why
>>> clustered/blocked data can be joined on those keys using map-side joins;
>>> Hive knows it can cache ab individual block in memory and the block will
>>> hold all records across the table for the keys in that block.
>>>
>>> So, Java MR apps and Pig can still read the records, but they won't
>>> necessarily understand how the data is organized. I.e., it might appear
>>> unsorted. Perhaps HCatalog will allow other tools to exploit the structure,
>>> but I'm not sure.
>>>
>>> dean
>>>
>>>
>>> On Sat, Mar 30, 2013 at 5:44 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote:
>>>
>>>> Thanks, Dean.
>>>>
>>>> Does that mean, this bucketing is exclusively Hive feature and not
>>>> available to others like Java, Pig, etc?
>>>>
>>>> And also, my final tables have to be managed tables; not external
>>>> tables, right?
>>>>  .
>>>> Thank again for your time and help.
>>>>
>>>> Sadu
>>>>
>>>>
>>>>
>>>> On Fri, Mar 29, 2013 at 5:57 PM, Dean Wampler <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> I don't know of any way to avoid creating new tables and moving the
>>>>> data. In fact, that's the official way to do it, from a temp table to the
>>>>> final table, so Hive can ensure the bucketing is done correctly: