Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> hive query doesn't seem to limit itself to partitions based on the WHERE clause


Copy link to this message
-
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause
Thanks, I'm going to try explain extended and see what I get.

I'm using HF.hr > '07' and HF.hr <= '07', because I actually want P*S*T (in
order to avoid the spring forward / fallback problem you mentioned).

Marc

On Wed, Oct 6, 2010 at 5:05 PM, Steven Wong <[EMAIL PROTECTED]> wrote:

>  What Hive version are you running? Try an “explain extended” on your
> insert query and see if unneeded partitions are included.
>
>
>
> Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PDT)
> is UTC-07:00. To convert UTC to PDT, the condition should be:
>
> (HF.dt = '2010-09-29' AND HF.hr >= '07' ) OR (HF.dt = '2010-09-30' AND
> HF.hr < '07' )
>
> instead of:
>
> (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr
> <= '07' )
>
>
>
> Good luck on the days we spring forward or fall back. J/L
>
>
>
>
>
> *From:* Marc Limotte [mailto:[EMAIL PROTECTED]]
> *Sent:* Wednesday, October 06, 2010 11:12 AM
>
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: RE: hive query doesn't seem to limit itself to partitions
> based on the WHERE clause
>
>
>
> Thanks for the response, Edward.
>
> The source table (hourly_fact) is partitioned on dt (date) and hr (hour),
> and I've confirmed that they are both String fields (CREATE stmt is below).
>
> The hourly_fact table contains 'number of requests' for each hour by a few
> dimensions.  The query is just trying to get a daily aggregation across
> those same dimensions.  The only trick is that the hourly_fact table has dt
> and hour in UTC time.  And the daily aggregation is being done for a PST
> (pacific std) day, hence the 7 hour offset.
>
> CREATE TABLE IF NOT EXISTS hourly_fact (
>   tagtype               STRING,
>   country               STRING,
>   company               INT,
>   request_keyword       STRING,
>   receiver_code         STRING,
>   referrer_domain       STRING,
>   num_requests          INT,
>   num_new_user_requests INT
> )
> PARTITIONED BY (dt STRING, hr STRING)
> ROW FORMAT DELIMITED
> STORED AS SEQUENCEFILE
> LOCATION "...";
>
>
> Marc
>
> On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <[EMAIL PROTECTED]>
> wrote:
>
> On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <[EMAIL PROTECTED]> wrote:
> > Hi Namit,
> >
> > Hourly_fact is partitioned on dt and hr.
> >
> > Marc
> >
> > On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote:
> >> What is your table hourly_fact partitioned on ?
> >>
> >> ________________________________________
> >> From: Marc Limotte [[EMAIL PROTECTED]]
> >> Sent: Friday, October 01, 2010 2:10 PM
> >> To: [EMAIL PROTECTED]
> >> Subject: hive query doesn't seem to limit itself to partitions based on
> >> the WHERE clause
> >>
> >> Hi,
> >>
> >> From looking at the hive log output, it seems that my job is accessing
> >> many more partitions than it needs to? For example, my query is
> something
> >> like:
> >>
> >> INSERT OVERWRITE TABLE daily_fact
> >> PARTITION (dt='2010-09-29')
> >> SELECT
> >> 20100929 as stamp,
> >> tagtype,
> >> country,
> >> sum(num_requests) AS num_requests
> >> FROM
> >> hourly_fact HF
> >> WHERE
> >> (HF.dt = '2010-09-29' AND HF.hr > '07' )
> >> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' )
> >> GROUP BY
> >> 20100929, tagtype, country
> >>
> >> Based on the WHERE clause, I would expect it to look only at partitions
> in
> >> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the
> log
> >> contains entries like:
> >>
> >> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file
> >> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10
> >>
> >> And many other hours outside my WHERE constraint. I assume this means
> that
> >> it's processing those directories. The answer still comes out right, but
> I'm
> >> concerned about the performance.
> >>
> >> Would appreciate some help understanding what this means and how to fix
> >> it.
> >>
> >> Thanks,
> >> Marc
> >>
> >>
> >
>
> Possibly you defined HF.hr <= '07'  as an int column and comparing it