Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # general - Restricting number of records from map output


Copy link to this message
-
Re: Restricting number of records from map output
Niels Basjes 2011-01-14, 16:46
Hi,

> I have a sort job consisting of only the Mapper (no Reducer) task. I want my
> results to contain only the top n records. Is there any way of restricting
> the number of records that are emitted by the Mappers?
>
> Basically I am looking to see if there is an equivalent of achieving
> the behavior similar to LIMIT in SQL queries.

I think I understand your goal. However the question is toward (what I
think) is the wrong solution.

A mapper gets 1 record as input and only knows about that one record.
There is no way to limit there.

If you implement a simple reducer you can very easily let is stop
reading the input iterator after N records and limit the output in
that way.

Doing it in the reducer also allows you to easily add a concept of
"Top N" by using the "Secondary Sort" trick to sort the input before
it arrives at the reducer.

HTH

Niels Basjes