Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Duplicate rows when using group by in subquery


Copy link to this message
-
Re: Duplicate rows when using group by in subquery
Yin Huai 2013-09-16, 17:52
Hello Mikael,

Seems your case is related to the bug reported in
https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive uses
a single MapReduce job to evaluate your query, "c.Symbol" and "c.catid" are
used to partitioning data, and thus, rows with the same value of "c.Symbol"
are not correctly grouped. If your case, only "c.Symbol" should be used if
we want to use a single MapReduce job to evaluate this query. Can you check
the query plan (results of "EXPLAIN") and see if both "c.Symbol" and
"c.catid" appear in partitioning columns? Or, you can also attach your
query plan.

This bug have been fixed in trunk. Can you test your query in trunk and see
if the result is correct. If you are using hive 0.11, you can also apply
the 0.11 patch (
https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch
).

Thanks,

Yin
On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <[EMAIL PROTECTED]> wrote:

> Hello.
>
> This is basically the same question I posted on stackoverflow:
> http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115
>
> I know the query is a bit noisy. But this query also demonstrates the
> error:
>
> select a.symbol from (select symbol, ordertype from cat group by symbol,
> ordertype) a group by a.symbol;
>
> Now, this query may not make much sense but in my case, because I have 24
> symbols, I expect a result of 24 rows. But I get 48 rows back. A similar
> query:
>
> select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group
> by c.Symbol,c.catid) a group by a.Symbol;
>
> returns 864 rows, where I still expect 24 rows... If there are
> alternatives as to how to write the original query in my SO post I would
> much appreciate hearing them. The examples given in this mail have just
> been provided to demonstrate the problem using easier to understand queries
> and I don't need advice on them.
>
> The .csv data and example is from a toy example. My real setup is 6 nodes,
> and the table definition is:
>
> create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
> int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
> ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
> SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
> int) row format delimited fields terminated by ',' stored as ORC;
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.max.dynamic.partitions.pernode=1000;
>
> insert...
>
> Thank you so much for any input.
>
> /Sincerely Mikael
>