Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Bucketing external tables


Copy link to this message
-
Re: Bucketing external tables
Mark Grover 2013-04-04, 05:36
Can you please check your Jobtracker logs? The is a generic error related
to grabbing the Task Attempt Log URL, the real error is in JT logs.

On Wed, Apr 3, 2013 at 7:17 PM, Sadananda Hegde <[EMAIL PROTECTED]> wrote:

> Hi Dean,
>
> I tried inserting a bucketed hive table from a non-bucketed table using
> insert overwrite .... select from clause; but I get the following error.
>
> ----------------------------------------------------------------------------------
> Exception in thread "Thread-225" java.lang.NullPointerException
>         at
> org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Shims.java:44)
>         at
> org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(JobDebugger.java:186)
>         at
> org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger.java:142)
>         at java.lang.Thread.run(Thread.java:662)
> FAILED: Execution Error, return code 2 from
> org.apache.hadoop.hive.ql.exec.MapRedTask
>
> --------------------------------------------------------------------------------------------------------------------------
>
> Both tables have same structure except that that one has CLUSTERED BY
> CLAUSE and other not.
>
> Some columns are defined as Array of Structs. The Insert statement works
> fine if I take out those complex columns. Are there any known issues
> loading STRUCT or ARRAY OF STRUCT fields?
>
>
> Thanks for your time and help.
>
> Sadu
>
>
>
>
> On Sat, Mar 30, 2013 at 7:00 PM, Dean Wampler <
> [EMAIL PROTECTED]> wrote:
>
>> The table can be external. You should be able to use this data with other
>> tools, because all bucketing does is ensure that all occurrences for
>> records with a given key are written into the same block. This is why
>> clustered/blocked data can be joined on those keys using map-side joins;
>> Hive knows it can cache ab individual block in memory and the block will
>> hold all records across the table for the keys in that block.
>>
>> So, Java MR apps and Pig can still read the records, but they won't
>> necessarily understand how the data is organized. I.e., it might appear
>> unsorted. Perhaps HCatalog will allow other tools to exploit the structure,
>> but I'm not sure.
>>
>> dean
>>
>>
>> On Sat, Mar 30, 2013 at 5:44 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote:
>>
>>> Thanks, Dean.
>>>
>>> Does that mean, this bucketing is exclusively Hive feature and not
>>> available to others like Java, Pig, etc?
>>>
>>> And also, my final tables have to be managed tables; not external
>>> tables, right?
>>>  .
>>> Thank again for your time and help.
>>>
>>> Sadu
>>>
>>>
>>>
>>> On Fri, Mar 29, 2013 at 5:57 PM, Dean Wampler <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> I don't know of any way to avoid creating new tables and moving the
>>>> data. In fact, that's the official way to do it, from a temp table to the
>>>> final table, so Hive can ensure the bucketing is done correctly:
>>>>
>>>>  https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>>>>
>>>> In other words, you might have a big move now, but going forward,
>>>> you'll want to stage your data in a temp table, use this procedure to put
>>>> it in the final location, then delete the temp data.
>>>>
>>>> dean
>>>>
>>>> On Fri, Mar 29, 2013 at 4:58 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We run M/R jobs to parse and process large and highly complex xml
>>>>> files into AVRO files. Then we build external Hive tables on top the parsed
>>>>> Avro files. The hive tables are partitioned by day; but they are still huge
>>>>> partitions and joins do not perform that well. So I would like to try
>>>>> out creating buckets on the join key. How do I create the buckets on the
>>>>> existing HDFS files? I would prefer to avoid creating another set of tables
>>>>> (bucketed) and load data from non-bucketed table to bucketed tables if at
>>>>> all possible. Is it possible to do the bucketing in Java as part of the M/R