Aside from what might be wrong here, isn't it true that sampling with the
bucket clause still works on non-bucketed tables; it's just inefficient
because it still scans the whole table? Or am I an idiot? ;)
On Wed, Mar 20, 2013 at 2:17 PM, Mark Grover <[EMAIL PROTECTED]>wrote:
> Hi Robert,
> Sampling in Hive is based on buckets. Therefore, you table needs to be
> appropriately bucketed.
> I would recommend storing the results of your inner query in a bucketed
> table. See how to populate a bucketed table at
> Then you will be able to be sample through it.
> If the predicate is on partition column, you may be able to get around the
> intermediate table requirement but in general, as far as I know,
> intermediate bucketed table might be the only choice.
> On Wed, Mar 20, 2013 at 9:56 AM, Robert Li <[EMAIL PROTECTED]>wrote:
>> Hi Everyone
>> I'm trying to use the TABLESAMPLE function to sample data, however it's a
>> little more complicated and I am having trouble getting it to run.
>> I know that this works fine and it will give me about 25% of the whole
>> select distinct s
>> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s)
>> where month <= 201211
>> However, in my situation I need to do a TABLESAMPLE on an outer query, a
>> simple example is
>> *select mytest.s *
>> *from *
>> * (select distinct s from testtable where month <= 201211)mytest*
>> or something like
>> *select table1.s*
>> *from *
>> * (select distinct s from testtable)table1*
>> * join*
>> * (select distinct s from test2table)table2*
>> * on table1.s=table2.s*
>> How do I use TABLESAMPLE in this case to sample the results of the outer
>> query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various
>> places of my query but it always returns some sort of syntax error and thus
>> not allowing the query to run.
>> Any help is appreciated.
*Dean Wampler, Ph.D.*