Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> hive query doesn't seem to limit itself to partitions based on the WHERE clause


+
Marc Limotte 2010-10-01, 21:10
+
Namit Jain 2010-10-04, 05:00
+
Marc Limotte 2010-10-05, 19:36
+
Edward Capriolo 2010-10-05, 23:30
Copy link to this message
-
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause
Thanks for the response, Edward.

The source table (hourly_fact) is partitioned on dt (date) and hr (hour),
and I've confirmed that they are both String fields (CREATE stmt is below).

The hourly_fact table contains 'number of requests' for each hour by a few
dimensions.  The query is just trying to get a daily aggregation across
those same dimensions.  The only trick is that the hourly_fact table has dt
and hour in UTC time.  And the daily aggregation is being done for a PST
(pacific std) day, hence the 7 hour offset.

CREATE TABLE IF NOT EXISTS hourly_fact (
  tagtype               STRING,
  country               STRING,
  company               INT,
  request_keyword       STRING,
  receiver_code         STRING,
  referrer_domain       STRING,
  num_requests          INT,
  num_new_user_requests INT
)
PARTITIONED BY (dt STRING, hr STRING)
ROW FORMAT DELIMITED
STORED AS SEQUENCEFILE
LOCATION "...";

Marc

On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <[EMAIL PROTECTED]> wrote:
> > Hi Namit,
> >
> > Hourly_fact is partitioned on dt and hr.
> >
> > Marc
> >
> > On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote:
> >> What is your table hourly_fact partitioned on ?
> >>
> >> ________________________________________
> >> From: Marc Limotte [[EMAIL PROTECTED]]
> >> Sent: Friday, October 01, 2010 2:10 PM
> >> To: [EMAIL PROTECTED]
> >> Subject: hive query doesn't seem to limit itself to partitions based on
> >> the WHERE clause
> >>
> >> Hi,
> >>
> >> From looking at the hive log output, it seems that my job is accessing
> >> many more partitions than it needs to? For example, my query is
> something
> >> like:
> >>
> >> INSERT OVERWRITE TABLE daily_fact
> >> PARTITION (dt='2010-09-29')
> >> SELECT
> >> 20100929 as stamp,
> >> tagtype,
> >> country,
> >> sum(num_requests) AS num_requests
> >> FROM
> >> hourly_fact HF
> >> WHERE
> >> (HF.dt = '2010-09-29' AND HF.hr > '07' )
> >> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' )
> >> GROUP BY
> >> 20100929, tagtype, country
> >>
> >> Based on the WHERE clause, I would expect it to look only at partitions
> in
> >> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the
> log
> >> contains entries like:
> >>
> >> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file
> >> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10
> >>
> >> And many other hours outside my WHERE constraint. I assume this means
> that
> >> it's processing those directories. The answer still comes out right, but
> I'm
> >> concerned about the performance.
> >>
> >> Would appreciate some help understanding what this means and how to fix
> >> it.
> >>
> >> Thanks,
> >> Marc
> >>
> >>
> >
> Possibly you defined HF.hr <= '07'  as an int column and comparing it
> as a string is resulting in a full table scan. Can you explain the
> query?
>
+
Steven Wong 2010-10-07, 00:05
+
Edward Capriolo 2010-10-07, 02:56
+
Marc Limotte 2010-10-07, 23:23
+
Steven Wong 2010-10-08, 20:39
+
Marc Limotte 2010-10-07, 23:12
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB