Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Using TABLESAMPLE on inner queries


+
Robert Li 2013-03-20, 16:56
+
Mark Grover 2013-03-20, 19:17
+
Dean Wampler 2013-03-20, 19:20
+
Mark Grover 2013-03-20, 19:27
Copy link to this message
-
Re: Using TABLESAMPLE on inner queries
Ramki Palle 2013-03-20, 21:25
You may use percent based (block sampling) sampling for non-bucketed
tables, though there are some restrictions.

https://cwiki.apache.org/Hive/languagemanual-sampling.html

Regards,
Ramki.
On Wed, Mar 20, 2013 at 12:27 PM, Mark Grover
<[EMAIL PROTECTED]>wrote:

> Hey Dean,
> I am not a power user of the sampling feature but my understanding was
> that sampling in Hive only works on bucketed tables. I am happy to be
> corrected though.
>
> Mark
>
>
> On Wed, Mar 20, 2013 at 12:20 PM, Dean Wampler <
> [EMAIL PROTECTED]> wrote:
>
>> Mark,
>>
>> Aside from what might be wrong here, isn't it true that sampling with the
>> bucket clause still works on non-bucketed tables; it's just inefficient
>> because it still scans the whole table? Or am I an idiot? ;)
>>
>> dean
>>
>> On Wed, Mar 20, 2013 at 2:17 PM, Mark Grover <[EMAIL PROTECTED]
>> > wrote:
>>
>>> Hi Robert,
>>> Sampling in Hive is based on buckets. Therefore, you table needs to be
>>> appropriately bucketed.
>>>
>>> I would recommend storing the results of your inner query in a bucketed
>>> table. See how to populate a bucketed table at
>>> https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>>>
>>> Then you will be able to be sample through it.
>>>
>>> If the predicate is on partition column, you may be able to get around
>>> the intermediate table requirement but in general, as far as I know,
>>> intermediate bucketed table might be the only choice.
>>>
>>> Mark
>>>
>>> On Wed, Mar 20, 2013 at 9:56 AM, Robert Li <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hi Everyone
>>>>
>>>> I'm trying to use the TABLESAMPLE function to sample data, however it's
>>>> a little more complicated and I am having trouble getting it to run.
>>>>
>>>> I know that this works fine and it will give me about 25% of the whole
>>>> dataset
>>>>
>>>> select distinct s
>>>> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s)
>>>> where month <= 201211
>>>>
>>>> However, in my situation I need to do a TABLESAMPLE on an outer query,
>>>> a simple example is
>>>>
>>>> *select mytest.s *
>>>> *from *
>>>> * (select distinct s from testtable where month <= 201211)mytest*
>>>>
>>>> or something like
>>>>
>>>> *select table1.s*
>>>> *from *
>>>> * (select distinct s from testtable)table1*
>>>> * join*
>>>> * (select distinct s from test2table)table2*
>>>> * on table1.s=table2.s*
>>>>
>>>>
>>>> How do I use TABLESAMPLE in this case to sample the results of the
>>>> outer query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various
>>>> places of my query but it always returns some sort of syntax error and thus
>>>> not allowing the query to run.
>>>>
>>>> Any help is appreciated.
>>>>
>>>> Robert
>>>> **
>>>>
>>>
>>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>
>