Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Optimizing ORC Sorting - Replace two level Partitions with one?


Copy link to this message
-
Re: Optimizing ORC Sorting - Replace two level Partitions with one?
Edward Capriolo 2013-08-10, 17:19
Bucketing does deal with that if you bucket on column you always get bucket
number of files. Because your hashing the value into a bucket.

A query scanning many partitions and files is needlessly slow from MR
overhead.
On Sat, Aug 10, 2013 at 12:58 PM, John Omernik <[EMAIL PROTECTED]> wrote:

> One issue with the bucketing is that the number of sources on any given
> day is dynamic. On some days it's 4, others it's 14 and it's also
> constantly changing.  I am hoping to use some of the features of the ORC
> files to almost make virtual partitions, but apparently I am going to run
> into issues either way.
>
> On another note, is there a limit to hive and partitions? I am hovering
> around 10k partitions on one table right now. It's still working, but some
> metadata operations can take a long time. The Sub-Partitions are going to
> hurt me here going forward I am guessing, so it may be worth flattening out
> to only days, even at the expense of read queries... thoughts?
>
>
>
> On Sat, Aug 10, 2013 at 11:46 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>
>> Agree with Edward,
>>
>> whole purpose of bucketing for me is to prune the data in where clause.
>> Else it totally defeats the purpose of splitting data into finite number of
>> identifiable distributions to improve the performance.
>>
>> But is my understanding correct that it  does help in reducing the number
>> of sub partitions we create at the bottom of table can be limited if we
>> identify the pattern does not exceed a finite number of values on that
>> partitions? (even if it cross this limit bucketting does take care of it
>> upto some volume)
>>
>>
>> On Sat, Aug 10, 2013 at 10:09 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:
>>
>>> So there is one thing to be really carefully about bucketing. Say you
>>> bucket a table into 10 buckets, select with where does not actually prune
>>> the input buckets so many queries scan all the buckets.
>>>
>>>
>>> On Sat, Aug 10, 2013 at 12:34 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>
>>>> will bucketing help? if you know finite # partiotions ?
>>>>
>>>>
>>>> On Sat, Aug 10, 2013 at 9:26 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> I have a table that currently uses RC files and has two levels of
>>>>> partitions.  day and source.  The table is first partitioned by day, then
>>>>> within each day there are 6-15 source partitions.  This makes for a lot of
>>>>> crazy partitions and was wondering if there'd be a way to optimize this
>>>>> with ORC files and some sorting.
>>>>>
>>>>> Specifically, would there be a way in a new table to make source a
>>>>> field (removing the partition)and somehow, as I am inserting into this new
>>>>> setup sort by source in such a way that will help separate the
>>>>> files/indexes in a way that gives me almost the same performance as ORC
>>>>> with the two level partitions?  Just trying to optimize here and curious
>>>>> what people think.
>>>>>
>>>>> John
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>