Sampling in Hive is based on buckets. Therefore, you table needs to be
I would recommend storing the results of your inner query in a bucketed
table. See how to populate a bucketed table at
Then you will be able to be sample through it.
If the predicate is on partition column, you may be able to get around the
intermediate table requirement but in general, as far as I know,
intermediate bucketed table might be the only choice.
On Wed, Mar 20, 2013 at 9:56 AM, Robert Li <[EMAIL PROTECTED]> wrote:
> Hi Everyone
> I'm trying to use the TABLESAMPLE function to sample data, however it's a
> little more complicated and I am having trouble getting it to run.
> I know that this works fine and it will give me about 25% of the whole
> select distinct s
> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s)
> where month <= 201211
> However, in my situation I need to do a TABLESAMPLE on an outer query, a
> simple example is
> *select mytest.s *
> *from *
> * (select distinct s from testtable where month <= 201211)mytest*
> or something like
> *select table1.s*
> *from *
> * (select distinct s from testtable)table1*
> * join*
> * (select distinct s from test2table)table2*
> * on table1.s=table2.s*
> How do I use TABLESAMPLE in this case to sample the results of the outer
> query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various
> places of my query but it always returns some sort of syntax error and thus
> not allowing the query to run.
> Any help is appreciated.