Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> hive query doesn't seem to limit itself to partitions based on the WHERE clause


Copy link to this message
-
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause
Thanks for the response, Edward.

The source table (hourly_fact) is partitioned on dt (date) and hr (hour),
and I've confirmed that they are both String fields (CREATE stmt is below).

The hourly_fact table contains 'number of requests' for each hour by a few
dimensions.  The query is just trying to get a daily aggregation across
those same dimensions.  The only trick is that the hourly_fact table has dt
and hour in UTC time.  And the daily aggregation is being done for a PST
(pacific std) day, hence the 7 hour offset.

CREATE TABLE IF NOT EXISTS hourly_fact (
  tagtype               STRING,
  country               STRING,
  company               INT,
  request_keyword       STRING,
  receiver_code         STRING,
  referrer_domain       STRING,
  num_requests          INT,
  num_new_user_requests INT
)
PARTITIONED BY (dt STRING, hr STRING)
ROW FORMAT DELIMITED
STORED AS SEQUENCEFILE
LOCATION "...";

Marc

On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <[EMAIL PROTECTED]> wrote:
> > Hi Namit,
> >
> > Hourly_fact is partitioned on dt and hr.
> >
> > Marc
> >
> > On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote:
> >> What is your table hourly_fact partitioned on ?
> >>
> >> ________________________________________
> >> From: Marc Limotte [[EMAIL PROTECTED]]
> >> Sent: Friday, October 01, 2010 2:10 PM
> >> To: [EMAIL PROTECTED]
> >> Subject: hive query doesn't seem to limit itself to partitions based on
> >> the WHERE clause
> >>
> >> Hi,
> >>
> >> From looking at the hive log output, it seems that my job is accessing
> >> many more partitions than it needs to? For example, my query is
> something
> >> like:
> >>
> >> INSERT OVERWRITE TABLE daily_fact
> >> PARTITION (dt='2010-09-29')
> >> SELECT
> >> 20100929 as stamp,
> >> tagtype,
> >> country,
> >> sum(num_requests) AS num_requests
> >> FROM
> >> hourly_fact HF
> >> WHERE
> >> (HF.dt = '2010-09-29' AND HF.hr > '07' )
> >> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' )
> >> GROUP BY
> >> 20100929, tagtype, country
> >>
> >> Based on the WHERE clause, I would expect it to look only at partitions
> in
> >> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the
> log
> >> contains entries like:
> >>
> >> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file
> >> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10
> >>
> >> And many other hours outside my WHERE constraint. I assume this means
> that
> >> it's processing those directories. The answer still comes out right, but
> I'm
> >> concerned about the performance.
> >>
> >> Would appreciate some help understanding what this means and how to fix
> >> it.
> >>
> >> Thanks,
> >> Marc
> >>
> >>
> >
> Possibly you defined HF.hr <= '07'  as an int column and comparing it
> as a string is resulting in a full table scan. Can you explain the
> query?
>