-Re: Optimizing ORC Sorting - Replace two level Partitions with one?
John Omernik 2013-08-10, 16:58
One issue with the bucketing is that the number of sources on any given day
is dynamic. On some days it's 4, others it's 14 and it's also constantly
changing. I am hoping to use some of the features of the ORC files to
almost make virtual partitions, but apparently I am going to run into
issues either way.
On another note, is there a limit to hive and partitions? I am hovering
around 10k partitions on one table right now. It's still working, but some
metadata operations can take a long time. The Sub-Partitions are going to
hurt me here going forward I am guessing, so it may be worth flattening out
to only days, even at the expense of read queries... thoughts?
On Sat, Aug 10, 2013 at 11:46 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
> Agree with Edward,
> whole purpose of bucketing for me is to prune the data in where clause.
> Else it totally defeats the purpose of splitting data into finite number of
> identifiable distributions to improve the performance.
> But is my understanding correct that it does help in reducing the number
> of sub partitions we create at the bottom of table can be limited if we
> identify the pattern does not exceed a finite number of values on that
> partitions? (even if it cross this limit bucketting does take care of it
> upto some volume)
> On Sat, Aug 10, 2013 at 10:09 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:
>> So there is one thing to be really carefully about bucketing. Say you
>> bucket a table into 10 buckets, select with where does not actually prune
>> the input buckets so many queries scan all the buckets.
>> On Sat, Aug 10, 2013 at 12:34 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>> will bucketing help? if you know finite # partiotions ?
>>> On Sat, Aug 10, 2013 at 9:26 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>>>> I have a table that currently uses RC files and has two levels of
>>>> partitions. day and source. The table is first partitioned by day, then
>>>> within each day there are 6-15 source partitions. This makes for a lot of
>>>> crazy partitions and was wondering if there'd be a way to optimize this
>>>> with ORC files and some sorting.
>>>> Specifically, would there be a way in a new table to make source a
>>>> field (removing the partition)and somehow, as I am inserting into this new
>>>> setup sort by source in such a way that will help separate the
>>>> files/indexes in a way that gives me almost the same performance as ORC
>>>> with the two level partitions? Just trying to optimize here and curious
>>>> what people think.
>>> Nitin Pawar
> Nitin Pawar