-Re: Optimizing ORC Sorting - Replace two level Partitions with one?
Nitin Pawar 2013-08-10, 16:46
Agree with Edward,
whole purpose of bucketing for me is to prune the data in where clause.
Else it totally defeats the purpose of splitting data into finite number of
identifiable distributions to improve the performance.
But is my understanding correct that it does help in reducing the number
of sub partitions we create at the bottom of table can be limited if we
identify the pattern does not exceed a finite number of values on that
partitions? (even if it cross this limit bucketting does take care of it
upto some volume)
On Sat, Aug 10, 2013 at 10:09 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:
> So there is one thing to be really carefully about bucketing. Say you
> bucket a table into 10 buckets, select with where does not actually prune
> the input buckets so many queries scan all the buckets.
> On Sat, Aug 10, 2013 at 12:34 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>> will bucketing help? if you know finite # partiotions ?
>> On Sat, Aug 10, 2013 at 9:26 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>>> I have a table that currently uses RC files and has two levels of
>>> partitions. day and source. The table is first partitioned by day, then
>>> within each day there are 6-15 source partitions. This makes for a lot of
>>> crazy partitions and was wondering if there'd be a way to optimize this
>>> with ORC files and some sorting.
>>> Specifically, would there be a way in a new table to make source a field
>>> (removing the partition)and somehow, as I am inserting into this new setup
>>> sort by source in such a way that will help separate the files/indexes in a
>>> way that gives me almost the same performance as ORC with the two level
>>> partitions? Just trying to optimize here and curious what people think.
>> Nitin Pawar