Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Duplicate rows when using group by in subquery


Copy link to this message
-
Re: Duplicate rows when using group by in subquery
Yin Huai 2013-09-19, 18:04
Maybe you were stilling using the cli which was pointing to hive 0.11 libs.
 After you build trunk (https://github.com/apache/hive.git), you need to
use <trunk-dir>/build/dist as your hive home and
use <trunk-dir>/build/dist/bin/hive to launch hive cli. You can find
hive 0.13 libs in <trunk-dir>/build/dist/lib

btw, seems trunk has an issue today. You can try hive 0.12 branch.
On Thu, Sep 19, 2013 at 4:26 AM, Mikael Öhman <[EMAIL PROTECTED]> wrote:

> Hello again.
>
> I have now checked out latest code from trunk and built as per
> instructions.
>
> However, this query:
>
> select a.Symbol, count(*)
> from (select Symbol, catid from cat group by Symbol, catid) a
> group by a.Symbol;
>
> still returns an incorrect number of rows for table:
>
>
> create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
> int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
> ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
> SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
> int) row format delimited fields terminated by ',' stored as ORC;
>
> Here is the result of EXPLAIN:
>
> hive> EXPLAIN select a.Symbol, count(*)
>     > from (select Symbol, catid from cat group by Symbol, catid) a
>     > group by a.Symbol;
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF
> (TOK_TABNAME cat))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
> (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL Symbol)) (TOK_SELEXPR
> (TOK_TABLE_OR_COL catid))) (TOK_GROUPBY (TOK_TABLE_OR_COL Symbol)
> (TOK_TABLE_OR_COL catid)))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) Symbol))
> (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL
> a) Symbol))))
>
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
>
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         a:cat
>           TableScan
>             alias: cat
>             Select Operator
>               expressions:
>                     expr: symbol
>                     type: string
>                     expr: catid
>                     type: bigint
>               outputColumnNames: symbol, catid
>               Group By Operator
>                 bucketGroup: false
>                 keys:
>                       expr: symbol
>                       type: string
>                       expr: catid
>                       type: bigint
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: bigint
>                   sort order: ++
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: bigint
>                   tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>             outputColumnNames: _col0
>             Group By Operator
>               aggregations:
>                     expr: count()
>               bucketGroup: false
>               keys:
>                     expr: _col0
>                     type: string
>               mode: complete
>               outputColumnNames: _col0, _col1
>               Select Operator
>                 expressions:
>                       expr: _col0