|
|
Hello,
I was wondering if there is a way to quick-verify a Hive query before it is run against a big dataset? The tables I am querying against have millions of records, and I'd like to verify my Hive query before I run it against all records.
Is there a way to test the query against a small subset of the data, without going into full MapReduce? As silly as this sounds, is there a way to MapReduce without the overhead of MapReduce? That way I can check my query is doing what I want before I run it against all records.
Thanks,
-Kyle
Joey D'Antoni 2013-03-05, 18:48
Just add a limit 1 to the end of your query. On Mar 5, 2013, at 1:45 PM, Kyle B <[EMAIL PROTECTED]> wrote:
> Hello, > > I was wondering if there is a way to quick-verify a Hive query before it is run against a big dataset? The tables I am querying against have millions of records, and I'd like to verify my Hive query before I run it against all records. > > Is there a way to test the query against a small subset of the data, without going into full MapReduce? As silly as this sounds, is there a way to MapReduce without the overhead of MapReduce? That way I can check my query is doing what I want before I run it against all records. > > Thanks, > > -Kyle
Connell, Chuck 2013-03-05, 18:51
Using the Hive sampling feature would also help. This is exactly what that feature is designed for.
Chuck From: Kyle B [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 05, 2013 1:45 PM To: [EMAIL PROTECTED] Subject: Hive sample test Hello,
I was wondering if there is a way to quick-verify a Hive query before it is run against a big dataset? The tables I am querying against have millions of records, and I'd like to verify my Hive query before I run it against all records.
Is there a way to test the query against a small subset of the data, without going into full MapReduce? As silly as this sounds, is there a way to MapReduce without the overhead of MapReduce? That way I can check my query is doing what I want before I run it against all records.
Thanks,
-Kyle
Dean Wampler 2013-03-05, 18:57
Unfortunately, it will still go through the whole thing, then just limit the output. However, there's a flag that I think only works in more recent Hive releases:
set hive.limit.optimize.enable=true
This is supposed to apply limiting earlier in the data stream, so it will give different results that limiting just the output.
Like Chuck said, you might consider sampling, but unless your table is organized into buckets, you'll at least scan the whole table, but maybe not do all computation over it ??
Also, if you have a small sample data set:
set hive.exec.mode.local.auto=true
will cause Hive to bypass the Job and Task Trackers, calling APIs directly, when it can do the whole thing in a single process. Not "lightning fast", but faster.
dean
On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni <[EMAIL PROTECTED]> wrote:
> Just add a limit 1 to the end of your query. > > > > > On Mar 5, 2013, at 1:45 PM, Kyle B <[EMAIL PROTECTED]> wrote: > > Hello, > > I was wondering if there is a way to quick-verify a Hive query before it > is run against a big dataset? The tables I am querying against have > millions of records, and I'd like to verify my Hive query before I run it > against all records. > > Is there a way to test the query against a small subset of the data, > without going into full MapReduce? As silly as this sounds, is there a way > to MapReduce without the overhead of MapReduce? That way I can check my > query is doing what I want before I run it against all records. > > Thanks, > > -Kyle > > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330
Mark Grover 2013-03-05, 19:26
I typically change my query to query from a limited version of the whole table.
Change
select really_expensive_select_clause from really_big_table where something=something group by something=something
to
select really_expensive_select_clause from ( select * from really_big_table limit 100 )t where something=something group by something=something On Tue, Mar 5, 2013 at 10:57 AM, Dean Wampler <[EMAIL PROTECTED]> wrote: > Unfortunately, it will still go through the whole thing, then just limit the > output. However, there's a flag that I think only works in more recent Hive > releases: > > set hive.limit.optimize.enable=true > > This is supposed to apply limiting earlier in the data stream, so it will > give different results that limiting just the output. > > Like Chuck said, you might consider sampling, but unless your table is > organized into buckets, you'll at least scan the whole table, but maybe not > do all computation over it ?? > > Also, if you have a small sample data set: > > set hive.exec.mode.local.auto=true > > will cause Hive to bypass the Job and Task Trackers, calling APIs directly, > when it can do the whole thing in a single process. Not "lightning fast", > but faster. > > dean > > On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni <[EMAIL PROTECTED]> wrote: >> >> Just add a limit 1 to the end of your query. >> >> >> >> >> On Mar 5, 2013, at 1:45 PM, Kyle B <[EMAIL PROTECTED]> wrote: >> >> Hello, >> >> I was wondering if there is a way to quick-verify a Hive query before it >> is run against a big dataset? The tables I am querying against have millions >> of records, and I'd like to verify my Hive query before I run it against all >> records. >> >> Is there a way to test the query against a small subset of the data, >> without going into full MapReduce? As silly as this sounds, is there a way >> to MapReduce without the overhead of MapReduce? That way I can check my >> query is doing what I want before I run it against all records. >> >> Thanks, >> >> -Kyle > > > > > -- > Dean Wampler, Ph.D. > thinkbiganalytics.com > +1-312-339-1330 >
Dean Wampler 2013-03-05, 19:44
NIce, yea that would do it.
On Tue, Mar 5, 2013 at 1:26 PM, Mark Grover <[EMAIL PROTECTED]>wrote:
> I typically change my query to query from a limited version of the whole > table. > > Change > > select really_expensive_select_clause > from > really_big_table > where > something=something > group by something=something > > to > > select really_expensive_select_clause > from > ( > select > * > from > really_big_table > limit 100 > )t > where > something=something > group by something=something > > > On Tue, Mar 5, 2013 at 10:57 AM, Dean Wampler > <[EMAIL PROTECTED]> wrote: > > Unfortunately, it will still go through the whole thing, then just limit > the > > output. However, there's a flag that I think only works in more recent > Hive > > releases: > > > > set hive.limit.optimize.enable=true > > > > This is supposed to apply limiting earlier in the data stream, so it will > > give different results that limiting just the output. > > > > Like Chuck said, you might consider sampling, but unless your table is > > organized into buckets, you'll at least scan the whole table, but maybe > not > > do all computation over it ?? > > > > Also, if you have a small sample data set: > > > > set hive.exec.mode.local.auto=true > > > > will cause Hive to bypass the Job and Task Trackers, calling APIs > directly, > > when it can do the whole thing in a single process. Not "lightning fast", > > but faster. > > > > dean > > > > On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni <[EMAIL PROTECTED]> > wrote: > >> > >> Just add a limit 1 to the end of your query. > >> > >> > >> > >> > >> On Mar 5, 2013, at 1:45 PM, Kyle B <[EMAIL PROTECTED]> wrote: > >> > >> Hello, > >> > >> I was wondering if there is a way to quick-verify a Hive query before it > >> is run against a big dataset? The tables I am querying against have > millions > >> of records, and I'd like to verify my Hive query before I run it > against all > >> records. > >> > >> Is there a way to test the query against a small subset of the data, > >> without going into full MapReduce? As silly as this sounds, is there a > way > >> to MapReduce without the overhead of MapReduce? That way I can check my > >> query is doing what I want before I run it against all records. > >> > >> Thanks, > >> > >> -Kyle > > > > > > > > > > -- > > Dean Wampler, Ph.D. > > thinkbiganalytics.com > > +1-312-339-1330 > > >
-- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330
Ramki Palle 2013-03-08, 11:30
If any of the 100 rows that the sub-query returns do not satisfy the where clause, there would be no rows in the overall result. Do we still consider that the Hive query is verified in this case?
Regards, Ramki. On Wed, Mar 6, 2013 at 1:14 AM, Dean Wampler < [EMAIL PROTECTED]> wrote:
> NIce, yea that would do it. > > > On Tue, Mar 5, 2013 at 1:26 PM, Mark Grover <[EMAIL PROTECTED]>wrote: > >> I typically change my query to query from a limited version of the whole >> table. >> >> Change >> >> select really_expensive_select_clause >> from >> really_big_table >> where >> something=something >> group by something=something >> >> to >> >> select really_expensive_select_clause >> from >> ( >> select >> * >> from >> really_big_table >> limit 100 >> )t >> where >> something=something >> group by something=something >> >> >> On Tue, Mar 5, 2013 at 10:57 AM, Dean Wampler >> <[EMAIL PROTECTED]> wrote: >> > Unfortunately, it will still go through the whole thing, then just >> limit the >> > output. However, there's a flag that I think only works in more recent >> Hive >> > releases: >> > >> > set hive.limit.optimize.enable=true >> > >> > This is supposed to apply limiting earlier in the data stream, so it >> will >> > give different results that limiting just the output. >> > >> > Like Chuck said, you might consider sampling, but unless your table is >> > organized into buckets, you'll at least scan the whole table, but maybe >> not >> > do all computation over it ?? >> > >> > Also, if you have a small sample data set: >> > >> > set hive.exec.mode.local.auto=true >> > >> > will cause Hive to bypass the Job and Task Trackers, calling APIs >> directly, >> > when it can do the whole thing in a single process. Not "lightning >> fast", >> > but faster. >> > >> > dean >> > >> > On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni <[EMAIL PROTECTED]> >> wrote: >> >> >> >> Just add a limit 1 to the end of your query. >> >> >> >> >> >> >> >> >> >> On Mar 5, 2013, at 1:45 PM, Kyle B <[EMAIL PROTECTED]> wrote: >> >> >> >> Hello, >> >> >> >> I was wondering if there is a way to quick-verify a Hive query before >> it >> >> is run against a big dataset? The tables I am querying against have >> millions >> >> of records, and I'd like to verify my Hive query before I run it >> against all >> >> records. >> >> >> >> Is there a way to test the query against a small subset of the data, >> >> without going into full MapReduce? As silly as this sounds, is there a >> way >> >> to MapReduce without the overhead of MapReduce? That way I can check my >> >> query is doing what I want before I run it against all records. >> >> >> >> Thanks, >> >> >> >> -Kyle >> > >> > >> > >> > >> > -- >> > Dean Wampler, Ph.D. >> > thinkbiganalytics.com >> > +1-312-339-1330 >> > >> > > > > -- > *Dean Wampler, Ph.D.* > thinkbiganalytics.com > +1-312-339-1330 > >
|
|