ReduceSinkDeduplication automatically kicked in because it is enabled by
default. The original plan tries to shuffle your data twice. Then,
ReduceSinkDeduplication finds that the original plan can be optimized to
shuffle your data once. But, when picking the partitioning columns, this
optimizer picked the wrong columns because of the bug.
Also, you can try your query with and without ReduceSinkDeduplication (use
"set hive.optimize.reducededuplication=false;" to turn this optimization
off), and see the performance. If the cardinality of "a.Symbol" limits the
degree of parallelism, two MapReduce jobs may be faster.
On Tue, Sep 17, 2013 at 2:24 AM, Mikael Öhman <[EMAIL PROTECTED]> wrote:
> Thank you for the information. Just to be clear, it is not that I have
> manually restricted the job to run using only a single mapreduce job, but
> it incorrectly assumes one job is enough?
> I will get back with results from your suggestions ASAP; unfortunately I
> don't have the machines available until Thursday.
> / Sincerely Mikael
> *Från:* Yin Huai <[EMAIL PROTECTED]>
> *Till:* [EMAIL PROTECTED]; Mikael Öhman <[EMAIL PROTECTED]>
> *Skickat:* måndag, 16 september 2013 19:52
> *Ämne:* Re: Duplicate rows when using group by in subquery
> Hello Mikael,
> Seems your case is related to the bug reported in
> https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive
> uses a single MapReduce job to evaluate your query, "c.Symbol" and
> "c.catid" are used to partitioning data, and thus, rows with the same value
> of "c.Symbol" are not correctly grouped. If your case, only "c.Symbol"
> should be used if we want to use a single MapReduce job to evaluate this
> query. Can you check the query plan (results of "EXPLAIN") and see if both
> "c.Symbol" and "c.catid" appear in partitioning columns? Or, you can also
> attach your query plan.
> This bug have been fixed in trunk. Can you test your query in trunk and
> see if the result is correct. If you are using hive 0.11, you can also
> apply the 0.11 patch (
> On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <[EMAIL PROTECTED]>wrote:
> This is basically the same question I posted on stackoverflow:
> I know the query is a bit noisy. But this query also demonstrates the
> select a.symbol from (select symbol, ordertype from cat group by symbol,
> ordertype) a group by a.symbol;
> Now, this query may not make much sense but in my case, because I have 24
> symbols, I expect a result of 24 rows. But I get 48 rows back. A similar
> select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group
> by c.Symbol,c.catid) a group by a.Symbol;
> returns 864 rows, where I still expect 24 rows... If there are
> alternatives as to how to write the original query in my SO post I would
> much appreciate hearing them. The examples given in this mail have just
> been provided to demonstrate the problem using easier to understand queries
> and I don't need advice on them.
> The .csv data and example is from a toy example. My real setup is 6 nodes,
> and the table definition is:
> create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
> int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
> ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
> SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
> int) row format delimited fields terminated by ',' stored as ORC;
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.max.dynamic.partitions.pernode=1000;
> Thank you so much for any input.
> /Sincerely Mikael