Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> Restricting number of records from map output


Copy link to this message
-
Re: Restricting number of records from map output
Hi,

> I have a sort job consisting of only the Mapper (no Reducer) task. I want my
> results to contain only the top n records. Is there any way of restricting
> the number of records that are emitted by the Mappers?
>
> Basically I am looking to see if there is an equivalent of achieving
> the behavior similar to LIMIT in SQL queries.

I think I understand your goal. However the question is toward (what I
think) is the wrong solution.

A mapper gets 1 record as input and only knows about that one record.
There is no way to limit there.

If you implement a simple reducer you can very easily let is stop
reading the input iterator after N records and limit the output in
that way.

Doing it in the reducer also allows you to easily add a concept of
"Top N" by using the "Secondary Sort" trick to sort the input before
it arrives at the reducer.

HTH

Niels Basjes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB