Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Using TABLESAMPLE on inner queries


+
Robert Li 2013-03-20, 16:56
+
Mark Grover 2013-03-20, 19:17
Copy link to this message
-
Re: Using TABLESAMPLE on inner queries
Dean Wampler 2013-03-20, 19:20
Mark,

Aside from what might be wrong here, isn't it true that sampling with the
bucket clause still works on non-bucketed tables; it's just inefficient
because it still scans the whole table? Or am I an idiot? ;)

dean

On Wed, Mar 20, 2013 at 2:17 PM, Mark Grover <[EMAIL PROTECTED]>wrote:

> Hi Robert,
> Sampling in Hive is based on buckets. Therefore, you table needs to be
> appropriately bucketed.
>
> I would recommend storing the results of your inner query in a bucketed
> table. See how to populate a bucketed table at
> https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>
> Then you will be able to be sample through it.
>
> If the predicate is on partition column, you may be able to get around the
> intermediate table requirement but in general, as far as I know,
> intermediate bucketed table might be the only choice.
>
> Mark
>
> On Wed, Mar 20, 2013 at 9:56 AM, Robert Li <[EMAIL PROTECTED]>wrote:
>
>> Hi Everyone
>>
>> I'm trying to use the TABLESAMPLE function to sample data, however it's a
>> little more complicated and I am having trouble getting it to run.
>>
>> I know that this works fine and it will give me about 25% of the whole
>> dataset
>>
>> select distinct s
>> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s)
>> where month <= 201211
>>
>> However, in my situation I need to do a TABLESAMPLE on an outer query, a
>> simple example is
>>
>> *select mytest.s *
>> *from *
>> * (select distinct s from testtable where month <= 201211)mytest*
>>
>> or something like
>>
>> *select table1.s*
>> *from *
>> * (select distinct s from testtable)table1*
>> * join*
>> * (select distinct s from test2table)table2*
>> * on table1.s=table2.s*
>>
>>
>> How do I use TABLESAMPLE in this case to sample the results of the outer
>> query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various
>> places of my query but it always returns some sort of syntax error and thus
>> not allowing the query to run.
>>
>> Any help is appreciated.
>>
>> Robert
>> **
>>
>
>
--
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330
+
Mark Grover 2013-03-20, 19:27
+
Ramki Palle 2013-03-20, 21:25