Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Bucketing external tables


Copy link to this message
-
Re: Bucketing external tables
The table can be external. You should be able to use this data with other
tools, because all bucketing does is ensure that all occurrences for
records with a given key are written into the same block. This is why
clustered/blocked data can be joined on those keys using map-side joins;
Hive knows it can cache ab individual block in memory and the block will
hold all records across the table for the keys in that block.

So, Java MR apps and Pig can still read the records, but they won't
necessarily understand how the data is organized. I.e., it might appear
unsorted. Perhaps HCatalog will allow other tools to exploit the structure,
but I'm not sure.

dean

On Sat, Mar 30, 2013 at 5:44 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote:

> Thanks, Dean.
>
> Does that mean, this bucketing is exclusively Hive feature and not
> available to others like Java, Pig, etc?
>
> And also, my final tables have to be managed tables; not external tables,
> right?
>  .
> Thank again for your time and help.
>
> Sadu
>
>
>
> On Fri, Mar 29, 2013 at 5:57 PM, Dean Wampler <
> [EMAIL PROTECTED]> wrote:
>
>> I don't know of any way to avoid creating new tables and moving the data.
>> In fact, that's the official way to do it, from a temp table to the final
>> table, so Hive can ensure the bucketing is done correctly:
>>
>>  https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>>
>> In other words, you might have a big move now, but going forward, you'll
>> want to stage your data in a temp table, use this procedure to put it in
>> the final location, then delete the temp data.
>>
>> dean
>>
>> On Fri, Mar 29, 2013 at 4:58 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote:
>>
>>> Hello,
>>>
>>> We run M/R jobs to parse and process large and highly complex xml files
>>> into AVRO files. Then we build external Hive tables on top the parsed Avro
>>> files. The hive tables are partitioned by day; but they are still huge
>>> partitions and joins do not perform that well. So I would like to try
>>> out creating buckets on the join key. How do I create the buckets on the
>>> existing HDFS files? I would prefer to avoid creating another set of tables
>>> (bucketed) and load data from non-bucketed table to bucketed tables if at
>>> all possible. Is it possible to do the bucketing in Java as part of the M/R
>>> jobs while creating the Avro files?
>>>
>>> Any help / insight would greatly be appreciated.
>>>
>>> Thank you very much for your time and help.
>>>
>>> Sadu
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>
--
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330