I don't know anything about statistics but in your case, duplicating
splits(x100?) by using custom InputFormat might much simpler.
2013/9/6 Sameer Agarwal <[EMAIL PROTECTED]>
> Hi All,
> In order to support approximate queries in Hive and BlinkDB (
> http://blinkdb.org/), I am working towards implementing the bootstrap
> primitive (http://en.wikipedia.org/wiki/Bootstrapping_(statistics)) in
> that can help us quantify the "error" incurred by a query Q when it
> operates on a small sample S of data. This method essentially requires
> launching the query Q simultaneously on a large number of samples of
> original data (typically >=100) .
> The downside to this is of course that we have to launch the same query 100
> times but the upside is that each of this query would be so small that it
> can be executed on a single machine. So, in order to do this efficiently,
> we would ideally like to execute 100 instances of the query simultaneously
> on the master and all available worker nodes. Furthermore, in order to
> avoid generating the query plan 100 times on the master, we can do either
> of the two things:
> 1. Generate the query plan once on the master, serialize it and ship it
> to the worker nodes.
> 2. Enable the worker nodes to access the Metastore so that they can
> generate the query plan on their own in parallel.
> Given that making the query plan serializable (1) would require a lot of
> refactoring of the current code, is (2) a viable option? Moreover, since
> (2) will increase the load on the existing Metastore by 100x, is there any
> other option?
> Sameer Agarwal
> Computer Science | AMP Lab | UC Berkeley