Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Hive sample test


+
Kyle B 2013-03-05, 18:45
+
Connell, Chuck 2013-03-05, 18:51
+
Joey DAntoni 2013-03-05, 18:48
+
Dean Wampler 2013-03-05, 18:57
Copy link to this message
-
Re: Hive sample test
I typically change my query to query from a limited version of the whole table.

Change

select really_expensive_select_clause
from
really_big_table
where
something=something
group by something=something

to

select really_expensive_select_clause
from
(
select
*
from
really_big_table
limit 100
)t
where
something=something
group by something=something
On Tue, Mar 5, 2013 at 10:57 AM, Dean Wampler
<[EMAIL PROTECTED]> wrote:
> Unfortunately, it will still go through the whole thing, then just limit the
> output. However, there's a flag that I think only works in more recent Hive
> releases:
>
> set hive.limit.optimize.enable=true
>
> This is supposed to apply limiting earlier in the data stream, so it will
> give different results that limiting just the output.
>
> Like Chuck said, you might consider sampling, but unless your table is
> organized into buckets, you'll at least scan the whole table, but maybe not
> do all computation over it ??
>
> Also, if you have a small sample data set:
>
> set hive.exec.mode.local.auto=true
>
> will cause Hive to bypass the Job and Task Trackers, calling APIs directly,
> when it can do the whole thing in a single process. Not "lightning fast",
> but faster.
>
> dean
>
> On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni <[EMAIL PROTECTED]> wrote:
>>
>> Just add a limit 1 to the end of your query.
>>
>>
>>
>>
>> On Mar 5, 2013, at 1:45 PM, Kyle B <[EMAIL PROTECTED]> wrote:
>>
>> Hello,
>>
>> I was wondering if there is a way to quick-verify a Hive query before it
>> is run against a big dataset? The tables I am querying against have millions
>> of records, and I'd like to verify my Hive query before I run it against all
>> records.
>>
>> Is there a way to test the query against a small subset of the data,
>> without going into full MapReduce? As silly as this sounds, is there a way
>> to MapReduce without the overhead of MapReduce? That way I can check my
>> query is doing what I want before I run it against all records.
>>
>> Thanks,
>>
>> -Kyle
>
>
>
>
> --
> Dean Wampler, Ph.D.
> thinkbiganalytics.com
> +1-312-339-1330
>
+
Dean Wampler 2013-03-05, 19:44
+
Ramki Palle 2013-03-08, 11:30
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB