Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> max() returns NULL on partition column

Copy link to this message
Re: max() returns NULL on partition column
Thanks for confirming that I wasn't imagining this. Turns out this was a
known bug (HIVE-2955 https://issues.apache.org/jira/browse/HIVE-2955 ) and
is already fixed in version 0.10.0


On 4/22/13 11:25 AM, "Mike Liddell" <[EMAIL PROTECTED]> wrote:

>I was interested so tried a quick repro.  Yes I see the behavior too.
>Environment: hive-0.9.0 on windows.
>Hive> CREATE TABLE ptable(val int) PARTITIONED BY(year int, month int,
>day int);
>Hive> LOAD DATA LOCAL INPATH d:/data/2013-01-01.data.txt INTO TABLE
>ptable PARTITION(2013,1,1);
>// the txt file has separate lines with values 1,2,3,4.
>Hive> select max(day) from ptable;
>Hive> select avg(day) from ptable;
>Other normal queries work fine. I didn't try other aggregations.
>-----Original Message-----
>From: Robin Morris [mailto:[EMAIL PROTECTED]]
>Sent: Thursday, April 18, 2013 1:15 PM
>Subject: max() returns NULL on partition column
>partitioned_table is partitioned on year, month, day.
>A query of the form
>> select max(day) from partitioned_table where year=2013 and month=4;
>spins up zero mappers, one reducer, and returns NULL
>Hadoop job information for Stage-1: number of mappers: 0; number of
>reducers: 1
>2013-04-18 12:35:57,842 Stage-1 map = 0%,  reduce = 0%
>2013-04-18 12:36:00,855 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU
>1.36 sec
>2013-04-18 12:36:01,860 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU
>1.36 sec
>2013-04-18 12:36:02,865 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
>1.36 sec
>MapReduce Total cumulative CPU time: 1 seconds 360 msec Ended Job >job_201302011159_205031 MapReduce Jobs Launched:
>Job 0: Reduce: 1   Accumulative CPU: 1.36 sec   HDFS Read: 0 HDFS Write: 3
>Total MapReduce CPU Time Spent: 1 seconds 360 msec OK NULL Time taken:
>8.351 seconds
>Trying to work round it by doing something like select max(day) from
>(select day from partitioned_table where year=2013 and month=4)foo;
>doesn't work either, I'm guessing because the optimizer pushes down the
>max() into the subquery, resulting in the same query as above.
>Using an explicit intermediate table does work
>> create table foo_max as select day from partitioned_table where
>>year=2013 and month=4;  select max(day) from foo_max; drop table
>Several map-reduce jobs later, the correct answer (of 18) is given.
>distinct() has similar behavior, except that rather than returning NULL,
>it returns nothing.
>min() returns NULL
>avg() works!
>If someone else can reproduce this, I'll submit a bug.