Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Using Cluster by to improve Group by Performance


Copy link to this message
-
Re: Using Cluster by to improve Group by Performance
>From perspective of RS, two query is just different in hash code of RS key.
The cost of calculating hash of col3 and col4 might be negligible, I think.

2013/11/1 KayVajj <[EMAIL PROTECTED]>

> Any response or pointers to understand how Cluster By in sub queries can
> affect the performance/speed of outer queries is helpful.
>
> Thanks
> Kay
>
>
> On Mon, Oct 28, 2013 at 1:17 PM, KayVajj <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> I have a question if I could use the cluster by clause in a sub query to
>> improve the performance of a group by query in hive
>>
>> Lets I have a Table A with columns (all strings) col1..col5 and the table
>> is not "Clustered"
>>
>> now I 'm trying to run the below query
>>
>> select
>>> col1,
>>> col2,
>>> col3,
>>> col4,
>>> concat_ws(',', collect_set(col5))
>>> from A
>>> group by
>>> col1,
>>> col2,
>>> col3,
>>> col4
>>
>>
>>
>> Would the below query optimize the above query and if not what is the
>> best practice to optimize this query. Assuming only col1 & col2 are the
>> uniquely identifying columns
>>
>>
>>
>>
>> select
>>> ct.col1,
>>> ct.col2,
>>> ct.col3,
>>> ct.col4,
>>> concat_ws(',', collect_set(ct.col5))
>>> from
>>> (
>>> select
>>> col1,
>>> col2,
>>> col3,
>>> col4,
>>> col5
>>> from A
>>> cluster by col1, col2
>>> ) ct
>>> group by
>>> ct.col1,
>>> ct.col2,
>>> ct.col3,
>>> ct.col4.
>>
>>
>> Thanks for your responses.
>>
>>
>