Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # dev - Accumulo iterator to return a random sample of a percentile of a table


Copy link to this message
-
Re: Accumulo iterator to return a random sample of a percentile of a table
Chris Bennight 2014-02-05, 11:43
If it's for the input to some algorithm (machine learning, etc.) I'm
assuming it *is* important to have that 25% be representative of the entire
population.

HBase implements a simple strategy with a [1]RandomRowFilter that could
trivially be adapted to an accumulo filter (Iterator).  The caveat being
it's going to be essentially a full table scan each time - set a
percentage, and then randomly choose if each key is accepted or not.   Note
that if each of your "values" (i.e. the granularity you want to accept or
reject groups on) is more than one key value, you will want to use
something like the WholeRowIterator first to aggregate them, then test for
accept/reject.   You probably don't want to use the WholeRowIterator as is,
as you would want to test/reject on the full key, and only aggregate if it
passes - but you can use it as a pattern.

If you want something faster then I think you are going to  generate and
keep some population statistics / summaries on ingest, and query those.
This will add more sampling error based on the granularity of your
summaries - but you should be able to quantify that with standard error
propagation.
[1]
https://github.com/apache/hbase/blob/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/RandomRowFilter.java
On Tue, Feb 4, 2014 at 10:39 PM, cprigano <[EMAIL PROTECTED]> wrote:

> Good questions all! I am to start trying to just take a percentile of rows
> in a table similar to a percentile to construct training, cross-validation
> and testing sets. I am a machine learning person and what to be able to do
> say a 25% random sample of rows in a table ( I may not know the size and
> the percentile should be settable) Starting with the easiest assumption,
> that all row  are the say "type" will get things started. I can then move
> to more exotic scenarios. Accumulo is a new nut for me to crack and I would
> very much like your thoughts. Thanks mate!
>
>
> On Tue, Feb 4, 2014 at 7:27 PM, Chris Bennight [via Apache Accumulo] <
> ml-node+[EMAIL PROTECTED]> wrote:
>
> > I'm assuming you want a random selection of entries in accumulo - so say
> a
> > random selection of key's/values?
> >
> > How are your keys formatted (conceptually is fine); is there some sort of
> > regularity to them?  (I.e. can you calculate ahead of time a random
> > distribution of keys without validating which keys are present)?
> >
> > If you can't calculate the key distribution ahead of time, are you
> keeping
> > any statistics (or could you) on ingest (cardinality, distribution, etc.)
> > -
> > and finally, how rigorous and performant do you need this random sampling
> > to be?  Do you just want representative data, or are you trying to do
> > something like BlinkDB[1]  (allow people to specify confidence intervals
> > on
> > queries, and only sample enough data to meet the requisite uncertainty
> > requirements)?
> >
> > [1] http://blinkdb.org/
> >
> > Chris
> >
> >
> >
> >
> > On Sat, Feb 1, 2014 at 3:58 PM, cprigano <[hidden email]<
> http://user/SendEmail.jtp?type=node&node=7394&i=0>>
> > wrote:
> >
> > > I am looking at writing an Accumulo iterator to return a random sample
> > of a
> > > percentile of a table.
> > >
> > > I would appreciate any suggestions.
> > >
> > > Thnaks,
> > >
> > > Chris
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354.html
> > > Sent from the Developers mailing list archive at Nabble.com.
> > >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7394.html
> >  To unsubscribe from Accumulo iterator to return a random sample of a
> > percentile of a table, click here<
> http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7354&code=Y2hyaXMucC5yaWdhbm9AZ21haWwuY29tfDczNTR8NTkyODE0MjEy