Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> How to select random n records using mapreduce ?


Copy link to this message
-
Re: How to select random n records using mapreduce ?
Building on this, you could do something like the following to make it
more random:

if (numRecordsWritten < NUM_RECORDS_DESIRED) {
int n = generateARandomNumberBetween1and100();
if (n == 100) {
context.write(key, value);
}
}

The above would somewhat randomly output 1 record out of every 100, up
to a specified maximum amount desired, and discard all the rest.

HTH,

DR

On 06/27/2011 03:28 PM, Niels Basjes wrote:
> The only solution I can think of is by creating a counter in Hadoop
> that is incremented each time a mapper lets a record through.
> As soon as the value reaches a preselected value the mappers simply
> discard the additional input they receive.
>
> Note that this will not at all be random.... yet it's the best I can
> come up with right now.
>
> HTH
>
> On Mon, Jun 27, 2011 at 09:11, Jeff Zhang<[EMAIL PROTECTED]>  wrote:
>>
>> Hi all,
>> I'd like to select random N records from a large amount of data using
>> hadoop, just wonder how can I archive this ? Currently my idea is that let
>> each mapper task select N / mapper_number records. Does anyone has such
>> experience ?
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB