Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - max() returns NULL on partition column


Copy link to this message
-
max() returns NULL on partition column
Robin Morris 2013-04-18, 20:14
partitioned_table is partitioned on year, month, day.

A query of the form
> select max(day) from partitioned_table where year=2013 and month=4;
spins up zero mappers, one reducer, and returns NULL

Hadoop job information for Stage-1: number of mappers: 0; number of
reducers: 1
2013-04-18 12:35:57,842 Stage-1 map = 0%,  reduce = 0%
2013-04-18 12:36:00,855 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU
1.36 sec
2013-04-18 12:36:01,860 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU
1.36 sec
2013-04-18 12:36:02,865 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
1.36 sec
MapReduce Total cumulative CPU time: 1 seconds 360 msec
Ended Job = job_201302011159_205031
MapReduce Jobs Launched:
Job 0: Reduce: 1   Accumulative CPU: 1.36 sec   HDFS Read: 0 HDFS Write: 3
SUCESS
Total MapReduce CPU Time Spent: 1 seconds 360 msec
OK
NULL
Time taken: 8.351 seconds

Trying to work round it by doing something like
select max(day) from (select day from partitioned_table where year=2013
and month=4)foo;
doesn't work either, I'm guessing because the optimizer pushes down the
max() into the subquery, resulting in the same query as above.
Using an explicit intermediate table does work
> create table foo_max as select day from partitioned_table where
>year=2013 and month=4;  select max(day) from foo_max; drop table foo_max;
Several map-reduce jobs later, the correct answer (of 18) is given.
distinct() has similar behavior, except that rather than returning NULL,
it returns nothing.
min() returns NULL
avg() works!
If someone else can reproduce this, I'll submit a bug.

Robin
+
Mike Liddell 2013-04-22, 18:25
+
Robin Morris 2013-04-22, 22:54