Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - hive query doesn't seem to limit itself to partitions based on the WHERE clause


Copy link to this message
-
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause
Marc Limotte 2010-10-05, 19:36
Hi Namit,

Hourly_fact is partitioned on dt and hr.

Marc
On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote:
> What is your table hourly_fact partitioned on ?
>
> ________________________________________
> From: Marc Limotte [[EMAIL PROTECTED]]
> Sent: Friday, October 01, 2010 2:10 PM
> To: [EMAIL PROTECTED]
> Subject: hive query doesn't seem to limit itself to partitions based on
the WHERE clause
>
> Hi,
>
> From looking at the hive log output, it seems that my job is accessing
many more partitions than it needs to? For example, my query is something
like:
>
> INSERT OVERWRITE TABLE daily_fact
> PARTITION (dt='2010-09-29')
> SELECT
> 20100929 as stamp,
> tagtype,
> country,
> sum(num_requests) AS num_requests
> FROM
> hourly_fact HF
> WHERE
> (HF.dt = '2010-09-29' AND HF.hr > '07' )
> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' )
> GROUP BY
> 20100929, tagtype, country
>
> Based on the WHERE clause, I would expect it to look only at partitions in
the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log
contains entries like:
>
> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file
hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10
>
> And many other hours outside my WHERE constraint. I assume this means that
it's processing those directories. The answer still comes out right, but I'm
concerned about the performance.
>
> Would appreciate some help understanding what this means and how to fix
it.
>
> Thanks,
> Marc
>
>