Accumulo, mail # dev - Re: Accumulo iterator to return a random sample of a percentile of a table - 2014-02-05, 11:43
 Search Hadoop and all its subprojects:

Switch to Threaded View
Copy link to this message
-
Re: Accumulo iterator to return a random sample of a percentile of a table
If it's for the input to some algorithm (machine learning, etc.) I'm
assuming it *is* important to have that 25% be representative of the entire
population.

HBase implements a simple strategy with a [1]RandomRowFilter that could
trivially be adapted to an accumulo filter (Iterator).  The caveat being
it's going to be essentially a full table scan each time - set a
percentage, and then randomly choose if each key is accepted or not.   Note
that if each of your "values" (i.e. the granularity you want to accept or
reject groups on) is more than one key value, you will want to use
something like the WholeRowIterator first to aggregate them, then test for
accept/reject.   You probably don't want to use the WholeRowIterator as is,
as you would want to test/reject on the full key, and only aggregate if it
passes - but you can use it as a pattern.

If you want something faster then I think you are going to  generate and
keep some population statistics / summaries on ingest, and query those.
This will add more sampling error based on the granularity of your
summaries - but you should be able to quantify that with standard error
propagation.
[1]
https://github.com/apache/hbase/blob/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/RandomRowFilter.java
On Tue, Feb 4, 2014 at 10:39 PM, cprigano <[EMAIL PROTECTED]> wrote:
 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB