Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Using TABLESAMPLE on inner queries


+
Robert Li 2013-03-20, 16:56
+
Mark Grover 2013-03-20, 19:17
+
Dean Wampler 2013-03-20, 19:20
+
Mark Grover 2013-03-20, 19:27
Copy link to this message
-
Re: Using TABLESAMPLE on inner queries
You may use percent based (block sampling) sampling for non-bucketed
tables, though there are some restrictions.

https://cwiki.apache.org/Hive/languagemanual-sampling.html

Regards,
Ramki.
On Wed, Mar 20, 2013 at 12:27 PM, Mark Grover
<[EMAIL PROTECTED]>wrote:

> Hey Dean,
> I am not a power user of the sampling feature but my understanding was
> that sampling in Hive only works on bucketed tables. I am happy to be
> corrected though.
>
> Mark
>
>
> On Wed, Mar 20, 2013 at 12:20 PM, Dean Wampler <
> [EMAIL PROTECTED]> wrote:
>
>> Mark,
>>
>> Aside from what might be wrong here, isn't it true that sampling with the
>> bucket clause still works on non-bucketed tables; it's just inefficient
>> because it still scans the whole table? Or am I an idiot? ;)
>>
>> dean
>>
>> On Wed, Mar 20, 2013 at 2:17 PM, Mark Grover <[EMAIL PROTECTED]
>> > wrote:
>>
>>> Hi Robert,
>>> Sampling in Hive is based on buckets. Therefore, you table needs to be
>>> appropriately bucketed.
>>>
>>> I would recommend storing the results of your inner query in a bucketed
>>> table. See how to populate a bucketed table at
>>> https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>>>
>>> Then you will be able to be sample through it.
>>>
>>> If the predicate is on partition column, you may be able to get around
>>> the intermediate table requirement but in general, as far as I know,
>>> intermediate bucketed table might be the only choice.
>>>
>>> Mark
>>>
>>> On Wed, Mar 20, 2013 at 9:56 AM, Robert Li <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hi Everyone
>>>>
>>>> I'm trying to use the TABLESAMPLE function to sample data, however it's
>>>> a little more complicated and I am having trouble getting it to run.
>>>>
>>>> I know that this works fine and it will give me about 25% of the whole
>>>> dataset
>>>>
>>>> select distinct s
>>>> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s)
>>>> where month <= 201211
>>>>
>>>> However, in my situation I need to do a TABLESAMPLE on an outer query,
>>>> a simple example is
>>>>
>>>> *select mytest.s *
>>>> *from *
>>>> * (select distinct s from testtable where month <= 201211)mytest*
>>>>
>>>> or something like
>>>>
>>>> *select table1.s*
>>>> *from *
>>>> * (select distinct s from testtable)table1*
>>>> * join*
>>>> * (select distinct s from test2table)table2*
>>>> * on table1.s=table2.s*
>>>>
>>>>
>>>> How do I use TABLESAMPLE in this case to sample the results of the
>>>> outer query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various
>>>> places of my query but it always returns some sort of syntax error and thus
>>>> not allowing the query to run.
>>>>
>>>> Any help is appreciated.
>>>>
>>>> Robert
>>>> **
>>>>
>>>
>>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB