|
Marc Limotte
2010-10-01, 21:10
Namit Jain
2010-10-04, 05:00
Marc Limotte
2010-10-05, 19:36
Edward Capriolo
2010-10-05, 23:30
Marc Limotte
2010-10-06, 18:11
Steven Wong
2010-10-07, 00:05
Edward Capriolo
2010-10-07, 02:56
Marc Limotte
2010-10-07, 23:12
Marc Limotte
2010-10-07, 23:23
Steven Wong
2010-10-08, 20:39
|
-
hive query doesn't seem to limit itself to partitions based on the WHERE clauseMarc Limotte 2010-10-01, 21:10
Hi,
>From looking at the hive log output, it seems that my job is accessing many more partitions than it needs to? For example, my query is something like: INSERT OVERWRITE TABLE daily_fact PARTITION (dt='2010-09-29') SELECT 20100929 as stamp, tagtype, country, sum(num_requests) AS num_requests FROM hourly_fact HF WHERE (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) GROUP BY 20100929, tagtype, country Based on the WHERE clause, I would expect it to look only at partitions in the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log contains entries like: 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file > hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 > And many other hours outside my WHERE constraint. I assume this means that it's processing those directories. The answer still comes out right, but I'm concerned about the performance. Would appreciate some help understanding what this means and how to fix it. Thanks, Marc
-
RE: hive query doesn't seem to limit itself to partitions based on the WHERE clauseNamit Jain 2010-10-04, 05:00
What is your table hourly_fact partitioned on ?
________________________________________ From: Marc Limotte [[EMAIL PROTECTED]] Sent: Friday, October 01, 2010 2:10 PM To: [EMAIL PROTECTED] Subject: hive query doesn't seem to limit itself to partitions based on the WHERE clause Hi, >From looking at the hive log output, it seems that my job is accessing many more partitions than it needs to? For example, my query is something like: INSERT OVERWRITE TABLE daily_fact PARTITION (dt='2010-09-29') SELECT 20100929 as stamp, tagtype, country, sum(num_requests) AS num_requests FROM hourly_fact HF WHERE (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) GROUP BY 20100929, tagtype, country Based on the WHERE clause, I would expect it to look only at partitions in the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log contains entries like: 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 And many other hours outside my WHERE constraint. I assume this means that it's processing those directories. The answer still comes out right, but I'm concerned about the performance. Would appreciate some help understanding what this means and how to fix it. Thanks, Marc
-
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clauseMarc Limotte 2010-10-05, 19:36
Hi Namit,
Hourly_fact is partitioned on dt and hr. Marc On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote: > What is your table hourly_fact partitioned on ? > > ________________________________________ > From: Marc Limotte [[EMAIL PROTECTED]] > Sent: Friday, October 01, 2010 2:10 PM > To: [EMAIL PROTECTED] > Subject: hive query doesn't seem to limit itself to partitions based on the WHERE clause > > Hi, > > From looking at the hive log output, it seems that my job is accessing many more partitions than it needs to? For example, my query is something like: > > INSERT OVERWRITE TABLE daily_fact > PARTITION (dt='2010-09-29') > SELECT > 20100929 as stamp, > tagtype, > country, > sum(num_requests) AS num_requests > FROM > hourly_fact HF > WHERE > (HF.dt = '2010-09-29' AND HF.hr > '07' ) > OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) > GROUP BY > 20100929, tagtype, country > > Based on the WHERE clause, I would expect it to look only at partitions in the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log contains entries like: > > 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 > > And many other hours outside my WHERE constraint. I assume this means that it's processing those directories. The answer still comes out right, but I'm concerned about the performance. > > Would appreciate some help understanding what this means and how to fix it. > > Thanks, > Marc > >
-
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clauseEdward Capriolo 2010-10-05, 23:30
On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <[EMAIL PROTECTED]> wrote:
> Hi Namit, > > Hourly_fact is partitioned on dt and hr. > > Marc > > On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote: >> What is your table hourly_fact partitioned on ? >> >> ________________________________________ >> From: Marc Limotte [[EMAIL PROTECTED]] >> Sent: Friday, October 01, 2010 2:10 PM >> To: [EMAIL PROTECTED] >> Subject: hive query doesn't seem to limit itself to partitions based on >> the WHERE clause >> >> Hi, >> >> From looking at the hive log output, it seems that my job is accessing >> many more partitions than it needs to? For example, my query is something >> like: >> >> INSERT OVERWRITE TABLE daily_fact >> PARTITION (dt='2010-09-29') >> SELECT >> 20100929 as stamp, >> tagtype, >> country, >> sum(num_requests) AS num_requests >> FROM >> hourly_fact HF >> WHERE >> (HF.dt = '2010-09-29' AND HF.hr > '07' ) >> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) >> GROUP BY >> 20100929, tagtype, country >> >> Based on the WHERE clause, I would expect it to look only at partitions in >> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log >> contains entries like: >> >> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file >> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 >> >> And many other hours outside my WHERE constraint. I assume this means that >> it's processing those directories. The answer still comes out right, but I'm >> concerned about the performance. >> >> Would appreciate some help understanding what this means and how to fix >> it. >> >> Thanks, >> Marc >> >> > Possibly you defined HF.hr <= '07' as an int column and comparing it as a string is resulting in a full table scan. Can you explain the query?
-
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clauseMarc Limotte 2010-10-06, 18:11
Thanks for the response, Edward.
The source table (hourly_fact) is partitioned on dt (date) and hr (hour), and I've confirmed that they are both String fields (CREATE stmt is below). The hourly_fact table contains 'number of requests' for each hour by a few dimensions. The query is just trying to get a daily aggregation across those same dimensions. The only trick is that the hourly_fact table has dt and hour in UTC time. And the daily aggregation is being done for a PST (pacific std) day, hence the 7 hour offset. CREATE TABLE IF NOT EXISTS hourly_fact ( tagtype STRING, country STRING, company INT, request_keyword STRING, receiver_code STRING, referrer_domain STRING, num_requests INT, num_new_user_requests INT ) PARTITIONED BY (dt STRING, hr STRING) ROW FORMAT DELIMITED STORED AS SEQUENCEFILE LOCATION "..."; Marc On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <[EMAIL PROTECTED]> wrote: > > Hi Namit, > > > > Hourly_fact is partitioned on dt and hr. > > > > Marc > > > > On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote: > >> What is your table hourly_fact partitioned on ? > >> > >> ________________________________________ > >> From: Marc Limotte [[EMAIL PROTECTED]] > >> Sent: Friday, October 01, 2010 2:10 PM > >> To: [EMAIL PROTECTED] > >> Subject: hive query doesn't seem to limit itself to partitions based on > >> the WHERE clause > >> > >> Hi, > >> > >> From looking at the hive log output, it seems that my job is accessing > >> many more partitions than it needs to? For example, my query is > something > >> like: > >> > >> INSERT OVERWRITE TABLE daily_fact > >> PARTITION (dt='2010-09-29') > >> SELECT > >> 20100929 as stamp, > >> tagtype, > >> country, > >> sum(num_requests) AS num_requests > >> FROM > >> hourly_fact HF > >> WHERE > >> (HF.dt = '2010-09-29' AND HF.hr > '07' ) > >> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) > >> GROUP BY > >> 20100929, tagtype, country > >> > >> Based on the WHERE clause, I would expect it to look only at partitions > in > >> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the > log > >> contains entries like: > >> > >> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file > >> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 > >> > >> And many other hours outside my WHERE constraint. I assume this means > that > >> it's processing those directories. The answer still comes out right, but > I'm > >> concerned about the performance. > >> > >> Would appreciate some help understanding what this means and how to fix > >> it. > >> > >> Thanks, > >> Marc > >> > >> > > > Possibly you defined HF.hr <= '07' as an int column and comparing it > as a string is resulting in a full table scan. Can you explain the > query? >
-
RE: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clauseSteven Wong 2010-10-07, 00:05
What Hive version are you running? Try an "explain extended" on your insert query and see if unneeded partitions are included.
Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PDT) is UTC-07:00. To convert UTC to PDT, the condition should be: (HF.dt = '2010-09-29' AND HF.hr >= '07' ) OR (HF.dt = '2010-09-30' AND HF.hr < '07' ) instead of: (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) Good luck on the days we spring forward or fall back. :)/:( From: Marc Limotte [mailto:[EMAIL PROTECTED]] Sent: Wednesday, October 06, 2010 11:12 AM To: [EMAIL PROTECTED] Subject: Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause Thanks for the response, Edward. The source table (hourly_fact) is partitioned on dt (date) and hr (hour), and I've confirmed that they are both String fields (CREATE stmt is below). The hourly_fact table contains 'number of requests' for each hour by a few dimensions. The query is just trying to get a daily aggregation across those same dimensions. The only trick is that the hourly_fact table has dt and hour in UTC time. And the daily aggregation is being done for a PST (pacific std) day, hence the 7 hour offset. CREATE TABLE IF NOT EXISTS hourly_fact ( tagtype STRING, country STRING, company INT, request_keyword STRING, receiver_code STRING, referrer_domain STRING, num_requests INT, num_new_user_requests INT ) PARTITIONED BY (dt STRING, hr STRING) ROW FORMAT DELIMITED STORED AS SEQUENCEFILE LOCATION "..."; Marc On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hi Namit, > > Hourly_fact is partitioned on dt and hr. > > Marc > > On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> What is your table hourly_fact partitioned on ? >> >> ________________________________________ >> From: Marc Limotte [[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] >> Sent: Friday, October 01, 2010 2:10 PM >> To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> >> Subject: hive query doesn't seem to limit itself to partitions based on >> the WHERE clause >> >> Hi, >> >> From looking at the hive log output, it seems that my job is accessing >> many more partitions than it needs to? For example, my query is something >> like: >> >> INSERT OVERWRITE TABLE daily_fact >> PARTITION (dt='2010-09-29') >> SELECT >> 20100929 as stamp, >> tagtype, >> country, >> sum(num_requests) AS num_requests >> FROM >> hourly_fact HF >> WHERE >> (HF.dt = '2010-09-29' AND HF.hr > '07' ) >> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) >> GROUP BY >> 20100929, tagtype, country >> >> Based on the WHERE clause, I would expect it to look only at partitions in >> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the log >> contains entries like: >> >> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file >> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 >> >> And many other hours outside my WHERE constraint. I assume this means that >> it's processing those directories. The answer still comes out right, but I'm >> concerned about the performance. >> >> Would appreciate some help understanding what this means and how to fix >> it. >> >> Thanks, >> Marc >> >> > Possibly you defined HF.hr <= '07' as an int column and comparing it as a string is resulting in a full table scan. Can you explain the query?
-
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clauseEdward Capriolo 2010-10-07, 02:56
On Wed, Oct 6, 2010 at 8:05 PM, Steven Wong <[EMAIL PROTECTED]> wrote:
> What Hive version are you running? Try an “explain extended” on your insert > query and see if unneeded partitions are included. > > > > Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PDT) > is UTC-07:00. To convert UTC to PDT, the condition should be: > > (HF.dt = '2010-09-29' AND HF.hr >= '07' ) OR (HF.dt = '2010-09-30' AND HF.hr > < '07' ) > > instead of: > > (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr > <= '07' ) > > > > Good luck on the days we spring forward or fall back. J/L > > > > > > From: Marc Limotte [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, October 06, 2010 11:12 AM > To: [EMAIL PROTECTED] > Subject: Re: RE: hive query doesn't seem to limit itself to partitions based > on the WHERE clause > > > > Thanks for the response, Edward. > > The source table (hourly_fact) is partitioned on dt (date) and hr (hour), > and I've confirmed that they are both String fields (CREATE stmt is below). > > The hourly_fact table contains 'number of requests' for each hour by a few > dimensions. The query is just trying to get a daily aggregation across > those same dimensions. The only trick is that the hourly_fact table has dt > and hour in UTC time. And the daily aggregation is being done for a PST > (pacific std) day, hence the 7 hour offset. > > CREATE TABLE IF NOT EXISTS hourly_fact ( > tagtype STRING, > country STRING, > company INT, > request_keyword STRING, > receiver_code STRING, > referrer_domain STRING, > num_requests INT, > num_new_user_requests INT > ) > PARTITIONED BY (dt STRING, hr STRING) > ROW FORMAT DELIMITED > STORED AS SEQUENCEFILE > LOCATION "..."; > > Marc > > On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <[EMAIL PROTECTED]> > wrote: > > On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <[EMAIL PROTECTED]> wrote: >> Hi Namit, >> >> Hourly_fact is partitioned on dt and hr. >> >> Marc >> >> On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote: >>> What is your table hourly_fact partitioned on ? >>> >>> ________________________________________ >>> From: Marc Limotte [[EMAIL PROTECTED]] >>> Sent: Friday, October 01, 2010 2:10 PM >>> To: [EMAIL PROTECTED] >>> Subject: hive query doesn't seem to limit itself to partitions based on >>> the WHERE clause >>> >>> Hi, >>> >>> From looking at the hive log output, it seems that my job is accessing >>> many more partitions than it needs to? For example, my query is something >>> like: >>> >>> INSERT OVERWRITE TABLE daily_fact >>> PARTITION (dt='2010-09-29') >>> SELECT >>> 20100929 as stamp, >>> tagtype, >>> country, >>> sum(num_requests) AS num_requests >>> FROM >>> hourly_fact HF >>> WHERE >>> (HF.dt = '2010-09-29' AND HF.hr > '07' ) >>> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) >>> GROUP BY >>> 20100929, tagtype, country >>> >>> Based on the WHERE clause, I would expect it to look only at partitions >>> in >>> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the >>> log >>> contains entries like: >>> >>> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file >>> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 >>> >>> And many other hours outside my WHERE constraint. I assume this means >>> that >>> it's processing those directories. The answer still comes out right, but >>> I'm >>> concerned about the performance. >>> >>> Would appreciate some help understanding what this means and how to fix >>> it. >>> >>> Thanks, >>> Marc >>> >>> >> > > Possibly you defined HF.hr <= '07' as an int column and comparing it > as a string is resulting in a full table scan. Can you explain the > query? > > Since you defined '07' as a string you are getting a lexicographic comparison rather then a numeric one. That is why you are seeing more columns then you expect. = will work the same but < > will not. You can try to cast the query, or drop and add the partition using a numeric type.
-
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clauseMarc Limotte 2010-10-07, 23:12
Thanks, I'm going to try explain extended and see what I get.
I'm using HF.hr > '07' and HF.hr <= '07', because I actually want P*S*T (in order to avoid the spring forward / fallback problem you mentioned). Marc On Wed, Oct 6, 2010 at 5:05 PM, Steven Wong <[EMAIL PROTECTED]> wrote: > What Hive version are you running? Try an “explain extended” on your > insert query and see if unneeded partitions are included. > > > > Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PDT) > is UTC-07:00. To convert UTC to PDT, the condition should be: > > (HF.dt = '2010-09-29' AND HF.hr >= '07' ) OR (HF.dt = '2010-09-30' AND > HF.hr < '07' ) > > instead of: > > (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr > <= '07' ) > > > > Good luck on the days we spring forward or fall back. J/L > > > > > > *From:* Marc Limotte [mailto:[EMAIL PROTECTED]] > *Sent:* Wednesday, October 06, 2010 11:12 AM > > *To:* [EMAIL PROTECTED] > *Subject:* Re: RE: hive query doesn't seem to limit itself to partitions > based on the WHERE clause > > > > Thanks for the response, Edward. > > The source table (hourly_fact) is partitioned on dt (date) and hr (hour), > and I've confirmed that they are both String fields (CREATE stmt is below). > > The hourly_fact table contains 'number of requests' for each hour by a few > dimensions. The query is just trying to get a daily aggregation across > those same dimensions. The only trick is that the hourly_fact table has dt > and hour in UTC time. And the daily aggregation is being done for a PST > (pacific std) day, hence the 7 hour offset. > > CREATE TABLE IF NOT EXISTS hourly_fact ( > tagtype STRING, > country STRING, > company INT, > request_keyword STRING, > receiver_code STRING, > referrer_domain STRING, > num_requests INT, > num_new_user_requests INT > ) > PARTITIONED BY (dt STRING, hr STRING) > ROW FORMAT DELIMITED > STORED AS SEQUENCEFILE > LOCATION "..."; > > > Marc > > On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <[EMAIL PROTECTED]> > wrote: > > On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <[EMAIL PROTECTED]> wrote: > > Hi Namit, > > > > Hourly_fact is partitioned on dt and hr. > > > > Marc > > > > On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote: > >> What is your table hourly_fact partitioned on ? > >> > >> ________________________________________ > >> From: Marc Limotte [[EMAIL PROTECTED]] > >> Sent: Friday, October 01, 2010 2:10 PM > >> To: [EMAIL PROTECTED] > >> Subject: hive query doesn't seem to limit itself to partitions based on > >> the WHERE clause > >> > >> Hi, > >> > >> From looking at the hive log output, it seems that my job is accessing > >> many more partitions than it needs to? For example, my query is > something > >> like: > >> > >> INSERT OVERWRITE TABLE daily_fact > >> PARTITION (dt='2010-09-29') > >> SELECT > >> 20100929 as stamp, > >> tagtype, > >> country, > >> sum(num_requests) AS num_requests > >> FROM > >> hourly_fact HF > >> WHERE > >> (HF.dt = '2010-09-29' AND HF.hr > '07' ) > >> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) > >> GROUP BY > >> 20100929, tagtype, country > >> > >> Based on the WHERE clause, I would expect it to look only at partitions > in > >> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the > log > >> contains entries like: > >> > >> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file > >> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 > >> > >> And many other hours outside my WHERE constraint. I assume this means > that > >> it's processing those directories. The answer still comes out right, but > I'm > >> concerned about the performance. > >> > >> Would appreciate some help understanding what this means and how to fix > >> it. > >> > >> Thanks, > >> Marc > >> > >> > > > > Possibly you defined HF.hr <= '07' as an int column and comparing it
-
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clauseMarc Limotte 2010-10-07, 23:23
I'm using Hive 0.5.0.
Edward, I'm expecting a lexicographic comparison. I use two digit strings for hr (e.g. '07'), so I would think that the comparison would work. '08' is greater than '07'. I'll try your suggestions, maybe the cast or changing the data type will work. Thanks, Marc On Wed, Oct 6, 2010 at 7:56 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > On Wed, Oct 6, 2010 at 8:05 PM, Steven Wong <[EMAIL PROTECTED]> wrote: > > What Hive version are you running? Try an “explain extended” on your > insert > > query and see if unneeded partitions are included. > > > > > > > > Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time > (PDT) > > is UTC-07:00. To convert UTC to PDT, the condition should be: > > > > (HF.dt = '2010-09-29' AND HF.hr >= '07' ) OR (HF.dt = '2010-09-30' AND > HF.hr > > < '07' ) > > > > instead of: > > > > (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND > HF.hr > > <= '07' ) > > > > > > > > Good luck on the days we spring forward or fall back. J/L > > > > > > > > > > > > From: Marc Limotte [mailto:[EMAIL PROTECTED]] > > Sent: Wednesday, October 06, 2010 11:12 AM > > To: [EMAIL PROTECTED] > > Subject: Re: RE: hive query doesn't seem to limit itself to partitions > based > > on the WHERE clause > > > > > > > > Thanks for the response, Edward. > > > > The source table (hourly_fact) is partitioned on dt (date) and hr (hour), > > and I've confirmed that they are both String fields (CREATE stmt is > below). > > > > The hourly_fact table contains 'number of requests' for each hour by a > few > > dimensions. The query is just trying to get a daily aggregation across > > those same dimensions. The only trick is that the hourly_fact table has > dt > > and hour in UTC time. And the daily aggregation is being done for a PST > > (pacific std) day, hence the 7 hour offset. > > > > CREATE TABLE IF NOT EXISTS hourly_fact ( > > tagtype STRING, > > country STRING, > > company INT, > > request_keyword STRING, > > receiver_code STRING, > > referrer_domain STRING, > > num_requests INT, > > num_new_user_requests INT > > ) > > PARTITIONED BY (dt STRING, hr STRING) > > ROW FORMAT DELIMITED > > STORED AS SEQUENCEFILE > > LOCATION "..."; > > > > Marc > > > > On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <[EMAIL PROTECTED]> > > wrote: > > > > On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <[EMAIL PROTECTED]> > wrote: > >> Hi Namit, > >> > >> Hourly_fact is partitioned on dt and hr. > >> > >> Marc > >> > >> On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote: > >>> What is your table hourly_fact partitioned on ? > >>> > >>> ________________________________________ > >>> From: Marc Limotte [[EMAIL PROTECTED]] > >>> Sent: Friday, October 01, 2010 2:10 PM > >>> To: [EMAIL PROTECTED] > >>> Subject: hive query doesn't seem to limit itself to partitions based on > >>> the WHERE clause > >>> > >>> Hi, > >>> > >>> From looking at the hive log output, it seems that my job is accessing > >>> many more partitions than it needs to? For example, my query is > something > >>> like: > >>> > >>> INSERT OVERWRITE TABLE daily_fact > >>> PARTITION (dt='2010-09-29') > >>> SELECT > >>> 20100929 as stamp, > >>> tagtype, > >>> country, > >>> sum(num_requests) AS num_requests > >>> FROM > >>> hourly_fact HF > >>> WHERE > >>> (HF.dt = '2010-09-29' AND HF.hr > '07' ) > >>> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) > >>> GROUP BY > >>> 20100929, tagtype, country > >>> > >>> Based on the WHERE clause, I would expect it to look only at partitions > >>> in > >>> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, > the > >>> log > >>> contains entries like: > >>> > >>> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file > >>> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 > >>> > >>> And many other hours outside my WHERE constraint. I assume this means
-
RE: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clauseSteven Wong 2010-10-08, 20:39
The comparison should work. At least it does when I do this with my 0.5:
$ hive -e "select '01'<'07', '07'<'07', '20'<'07' from dual" true false false From: Marc Limotte [mailto:[EMAIL PROTECTED]] Sent: Thursday, October 07, 2010 4:24 PM To: [EMAIL PROTECTED] Subject: Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause I'm using Hive 0.5.0. Edward, I'm expecting a lexicographic comparison. I use two digit strings for hr (e.g. '07'), so I would think that the comparison would work. '08' is greater than '07'. I'll try your suggestions, maybe the cast or changing the data type will work. Thanks, Marc On Wed, Oct 6, 2010 at 7:56 PM, Edward Capriolo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: On Wed, Oct 6, 2010 at 8:05 PM, Steven Wong <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > What Hive version are you running? Try an "explain extended" on your insert > query and see if unneeded partitions are included. > > > > Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PDT) > is UTC-07:00. To convert UTC to PDT, the condition should be: > > (HF.dt = '2010-09-29' AND HF.hr >= '07' ) OR (HF.dt = '2010-09-30' AND HF.hr > < '07' ) > > instead of: > > (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr > <= '07' ) > > > > Good luck on the days we spring forward or fall back. J/L > > > > > > From: Marc Limotte [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] > Sent: Wednesday, October 06, 2010 11:12 AM > To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Subject: Re: RE: hive query doesn't seem to limit itself to partitions based > on the WHERE clause > > > > Thanks for the response, Edward. > > The source table (hourly_fact) is partitioned on dt (date) and hr (hour), > and I've confirmed that they are both String fields (CREATE stmt is below). > > The hourly_fact table contains 'number of requests' for each hour by a few > dimensions. The query is just trying to get a daily aggregation across > those same dimensions. The only trick is that the hourly_fact table has dt > and hour in UTC time. And the daily aggregation is being done for a PST > (pacific std) day, hence the 7 hour offset. > > CREATE TABLE IF NOT EXISTS hourly_fact ( > tagtype STRING, > country STRING, > company INT, > request_keyword STRING, > receiver_code STRING, > referrer_domain STRING, > num_requests INT, > num_new_user_requests INT > ) > PARTITIONED BY (dt STRING, hr STRING) > ROW FORMAT DELIMITED > STORED AS SEQUENCEFILE > LOCATION "..."; > > Marc > > On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> > wrote: > > On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> Hi Namit, >> >> Hourly_fact is partitioned on dt and hr. >> >> Marc >> >> On Oct 3, 2010 10:00 PM, "Namit Jain" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >>> What is your table hourly_fact partitioned on ? >>> >>> ________________________________________ >>> From: Marc Limotte [[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] >>> Sent: Friday, October 01, 2010 2:10 PM >>> To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> >>> Subject: hive query doesn't seem to limit itself to partitions based on >>> the WHERE clause >>> >>> Hi, >>> >>> From looking at the hive log output, it seems that my job is accessing >>> many more partitions than it needs to? For example, my query is something >>> like: >>> >>> INSERT OVERWRITE TABLE daily_fact >>> PARTITION (dt='2010-09-29') >>> SELECT >>> 20100929 as stamp, >>> tagtype, >>> country, >>> sum(num_requests) AS num_requests >>> FROM >>> hourly_fact HF >>> WHERE >>> (HF.dt = '2010-09-29' AND HF.hr > '07' ) >>> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) >>> GROUP BY >>> 20100929, tagtype, country Since you defined '07' as a string you are getting a lexicographic comparison rather then a numeric one. That is why you are seeing more columns then you expect. = will work the same but < > will not. You can try to cast the query, or drop and add the partition using a numeric type. |