Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Using TABLESAMPLE on inner queries


Copy link to this message
-
Re: Using TABLESAMPLE on inner queries
Hi Robert,
Sampling in Hive is based on buckets. Therefore, you table needs to be
appropriately bucketed.

I would recommend storing the results of your inner query in a bucketed
table. See how to populate a bucketed table at
https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html

Then you will be able to be sample through it.

If the predicate is on partition column, you may be able to get around the
intermediate table requirement but in general, as far as I know,
intermediate bucketed table might be the only choice.

Mark

On Wed, Mar 20, 2013 at 9:56 AM, Robert Li <[EMAIL PROTECTED]> wrote:

> Hi Everyone
>
> I'm trying to use the TABLESAMPLE function to sample data, however it's a
> little more complicated and I am having trouble getting it to run.
>
> I know that this works fine and it will give me about 25% of the whole
> dataset
>
> select distinct s
> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s)
> where month <= 201211
>
> However, in my situation I need to do a TABLESAMPLE on an outer query, a
> simple example is
>
> *select mytest.s *
> *from *
> * (select distinct s from testtable where month <= 201211)mytest*
>
> or something like
>
> *select table1.s*
> *from *
> * (select distinct s from testtable)table1*
> * join*
> * (select distinct s from test2table)table2*
> * on table1.s=table2.s*
>
>
> How do I use TABLESAMPLE in this case to sample the results of the outer
> query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various
> places of my query but it always returns some sort of syntax error and thus
> not allowing the query to run.
>
> Any help is appreciated.
>
> Robert
> **
>