Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Bucketing external tables


+
Sadananda Hegde 2013-03-29, 21:58
+
Dean Wampler 2013-03-29, 22:57
+
Sadananda Hegde 2013-03-30, 22:44
+
Dean Wampler 2013-03-31, 00:00
Copy link to this message
-
Re: Bucketing external tables
Sadananda Hegde 2013-04-04, 02:17
Hi Dean,

I tried inserting a bucketed hive table from a non-bucketed table using
insert overwrite .... select from clause; but I get the following error.
----------------------------------------------------------------------------------
Exception in thread "Thread-225" java.lang.NullPointerException
        at
org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Shims.java:44)
        at
org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(JobDebugger.java:186)
        at
org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger.java:142)
        at java.lang.Thread.run(Thread.java:662)
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask
--------------------------------------------------------------------------------------------------------------------------

Both tables have same structure except that that one has CLUSTERED BY
CLAUSE and other not.

Some columns are defined as Array of Structs. The Insert statement works
fine if I take out those complex columns. Are there any known issues
loading STRUCT or ARRAY OF STRUCT fields?
Thanks for your time and help.

Sadu
On Sat, Mar 30, 2013 at 7:00 PM, Dean Wampler <
[EMAIL PROTECTED]> wrote:

> The table can be external. You should be able to use this data with other
> tools, because all bucketing does is ensure that all occurrences for
> records with a given key are written into the same block. This is why
> clustered/blocked data can be joined on those keys using map-side joins;
> Hive knows it can cache ab individual block in memory and the block will
> hold all records across the table for the keys in that block.
>
> So, Java MR apps and Pig can still read the records, but they won't
> necessarily understand how the data is organized. I.e., it might appear
> unsorted. Perhaps HCatalog will allow other tools to exploit the structure,
> but I'm not sure.
>
> dean
>
>
> On Sat, Mar 30, 2013 at 5:44 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote:
>
>> Thanks, Dean.
>>
>> Does that mean, this bucketing is exclusively Hive feature and not
>> available to others like Java, Pig, etc?
>>
>> And also, my final tables have to be managed tables; not external tables,
>> right?
>>  .
>> Thank again for your time and help.
>>
>> Sadu
>>
>>
>>
>> On Fri, Mar 29, 2013 at 5:57 PM, Dean Wampler <
>> [EMAIL PROTECTED]> wrote:
>>
>>> I don't know of any way to avoid creating new tables and moving the
>>> data. In fact, that's the official way to do it, from a temp table to the
>>> final table, so Hive can ensure the bucketing is done correctly:
>>>
>>>  https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>>>
>>> In other words, you might have a big move now, but going forward, you'll
>>> want to stage your data in a temp table, use this procedure to put it in
>>> the final location, then delete the temp data.
>>>
>>> dean
>>>
>>> On Fri, Mar 29, 2013 at 4:58 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hello,
>>>>
>>>> We run M/R jobs to parse and process large and highly complex xml files
>>>> into AVRO files. Then we build external Hive tables on top the parsed Avro
>>>> files. The hive tables are partitioned by day; but they are still huge
>>>> partitions and joins do not perform that well. So I would like to try
>>>> out creating buckets on the join key. How do I create the buckets on the
>>>> existing HDFS files? I would prefer to avoid creating another set of tables
>>>> (bucketed) and load data from non-bucketed table to bucketed tables if at
>>>> all possible. Is it possible to do the bucketing in Java as part of the M/R
>>>> jobs while creating the Avro files?
>>>>
>>>> Any help / insight would greatly be appreciated.
>>>>
>>>> Thank you very much for your time and help.
>>>>
>>>> Sadu
>>>>
>>>
>>>
>>>
>>> --
>>> *Dean Wampler, Ph.D.*
>>> thinkbiganalytics.com
>>> +1-312-339-1330
>>>
>>>
>>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
+
Mark Grover 2013-04-04, 05:36
+
Sadananda Hegde 2013-04-05, 22:02
+
Mark Grover 2013-04-06, 15:07
+
Sadananda Hegde 2013-04-11, 17:46
+
Bejoy KS 2013-04-16, 15:13