Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # dev >> Bootstrap in Hive


+
Sameer Agarwal 2013-09-05, 22:07
Copy link to this message
-
Re: Bootstrap in Hive
I don't know anything about statistics but in your case, duplicating
splits(x100?) by using custom InputFormat might much simpler.
2013/9/6 Sameer Agarwal <[EMAIL PROTECTED]>

> Hi All,
>
> In order to support approximate queries in Hive and BlinkDB (
> http://blinkdb.org/), I am working towards implementing the bootstrap
> primitive (http://en.wikipedia.org/wiki/Bootstrapping_(statistics)) in
> Hive
> that can help us quantify the "error" incurred by a query Q when it
> operates on a small sample S of data. This method essentially requires
> launching the query Q simultaneously on a large number of samples of
> original data (typically >=100) .
>
> The downside to this is of course that we have to launch the same query 100
> times but the upside is that each of this query would be so small that it
> can be executed on a single machine. So, in order to do this efficiently,
> we would ideally like to execute 100 instances of the query simultaneously
> on the master and all available worker nodes. Furthermore, in order to
> avoid generating the query plan 100 times on the master, we can do either
> of the two things:
>
>    1. Generate the query plan once on the master, serialize it and ship it
>    to the worker nodes.
>    2. Enable the worker nodes to access the Metastore so that they can
>    generate the query plan on their own in parallel.
>
> Given that making the query plan serializable (1) would require a lot of
> refactoring of the current code, is (2) a viable option? Moreover, since
> (2) will increase the load on the existing Metastore by 100x, is there any
> other option?
>
> Thanks,
> Sameer
>
> --
> Sameer Agarwal
> Computer Science | AMP Lab | UC Berkeley
> http://cs.berkeley.edu/~sameerag
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB