partitioned_table is partitioned on year, month, day.
A query of the form
> select max(day) from partitioned_table where year=2013 and month=4;
spins up zero mappers, one reducer, and returns NULL
Hadoop job information for Stage-1: number of mappers: 0; number of
2013-04-18 12:35:57,842 Stage-1 map = 0%, reduce = 0%
2013-04-18 12:36:00,855 Stage-1 map = 0%, reduce = 100%, Cumulative CPU
2013-04-18 12:36:01,860 Stage-1 map = 0%, reduce = 100%, Cumulative CPU
2013-04-18 12:36:02,865 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
MapReduce Total cumulative CPU time: 1 seconds 360 msec
Ended Job = job_201302011159_205031
MapReduce Jobs Launched:
Job 0: Reduce: 1 Accumulative CPU: 1.36 sec HDFS Read: 0 HDFS Write: 3
Total MapReduce CPU Time Spent: 1 seconds 360 msec
Time taken: 8.351 seconds
Trying to work round it by doing something like
select max(day) from (select day from partitioned_table where year=2013
doesn't work either, I'm guessing because the optimizer pushes down the
max() into the subquery, resulting in the same query as above.
Using an explicit intermediate table does work
> create table foo_max as select day from partitioned_table where
>year=2013 and month=4; select max(day) from foo_max; drop table foo_max;
Several map-reduce jobs later, the correct answer (of 18) is given.
distinct() has similar behavior, except that rather than returning NULL,
it returns nothing.
min() returns NULL
If someone else can reproduce this, I'll submit a bug.