Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Block Sampling

Copy link to this message
Re: Block Sampling
Hi Anand,

This feature was implemented in HIVE-2121 and appeared in Hive 0.8.0.

Ref: https://issues.apache.org/jira/browse/HIVE-2121



On Fri, Jun 15, 2012 at 11:59 AM, Ladda, Anand <[EMAIL PROTECTED]>wrote:

>  Has the block sampling feature been added to one of the latest (Hive 0.8
> or Hive 0.9) releases. The wiki has the blurb below on block sampling****
> *Block Sampling*
> It is a feature that is still on trunk and is not yet in any release
> version.****
> block_sample: TABLESAMPLE (n PERCENT)****
> This will allow Hive to pick up at least n% data size (notice it doesn't
> necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is
> supported and some special compression formats are not handled. If we fail
> to sample it, the input of MapReduce job will be the whole table/partition.
> We do it in HDFS block level so that the sampling granularity is block
> size. For example, if block size is 256MB, even if n% of input size is only
> 100MB, you get 256MB of data.****
> In the following example the input size 0.1% or more will be used for the
> query.****
> SELECT * ** **
> FROM source TABLESAMPLE(0.1 PERCENT) s; ****
> Sometimes you want to sample the same data with different blocks, you can
> change this seed number:****
> set hive.sample.seednumber=<INTEGER>;****
> ** **