Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Duplicate rows when using group by in subquery


Copy link to this message
-
Re: Duplicate rows when using group by in subquery
Maybe you were stilling using the cli which was pointing to hive 0.11 libs.
 After you build trunk (https://github.com/apache/hive.git), you need to
use <trunk-dir>/build/dist as your hive home and
use <trunk-dir>/build/dist/bin/hive to launch hive cli. You can find
hive 0.13 libs in <trunk-dir>/build/dist/lib

btw, seems trunk has an issue today. You can try hive 0.12 branch.
On Thu, Sep 19, 2013 at 4:26 AM, Mikael Öhman <[EMAIL PROTECTED]> wrote:

> Hello again.
>
> I have now checked out latest code from trunk and built as per
> instructions.
>
> However, this query:
>
> select a.Symbol, count(*)
> from (select Symbol, catid from cat group by Symbol, catid) a
> group by a.Symbol;
>
> still returns an incorrect number of rows for table:
>
>
> create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
> int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
> ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
> SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
> int) row format delimited fields terminated by ',' stored as ORC;
>
> Here is the result of EXPLAIN:
>
> hive> EXPLAIN select a.Symbol, count(*)
>     > from (select Symbol, catid from cat group by Symbol, catid) a
>     > group by a.Symbol;
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF
> (TOK_TABNAME cat))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
> (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL Symbol)) (TOK_SELEXPR
> (TOK_TABLE_OR_COL catid))) (TOK_GROUPBY (TOK_TABLE_OR_COL Symbol)
> (TOK_TABLE_OR_COL catid)))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) Symbol))
> (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL
> a) Symbol))))
>
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
>
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         a:cat
>           TableScan
>             alias: cat
>             Select Operator
>               expressions:
>                     expr: symbol
>                     type: string
>                     expr: catid
>                     type: bigint
>               outputColumnNames: symbol, catid
>               Group By Operator
>                 bucketGroup: false
>                 keys:
>                       expr: symbol
>                       type: string
>                       expr: catid
>                       type: bigint
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: bigint
>                   sort order: ++
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: bigint
>                   tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>             outputColumnNames: _col0
>             Group By Operator
>               aggregations:
>                     expr: count()
>               bucketGroup: false
>               keys:
>                     expr: _col0
>                     type: string
>               mode: complete
>               outputColumnNames: _col0, _col1
>               Select Operator
>                 expressions:
>                       expr: _col0
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB