Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Merge sorting reduce output files


Copy link to this message
-
Re: Merge sorting reduce output files
Robert,

On Tue, Feb 28, 2012 at 23:28, Robert Evans <[EMAIL PROTECTED]> wrote:

>  I am not sure I can help with that unless I know better what “a special
> distribution” means.
>

The thing is that this application is a "Auto Complete" feature that has a
key that is "the letters that have been typed so far".
Now for several reasons we need this to be sorted by length of the input.
So the '1 letter suggestions' first, then the '2 letter suggestions' etc.
I've been trying to come up with an automatic partitioning that would split
the dataset into something like 30 parts that when concatenated do what you
suggest.

Unless you are doing a massive amount of processing in your reducer having
> a partition that is only close to balancing the distribution is a big win
> over all of the other options that put the data on a single machine and
> sort it there.  Even if you are doing a lot of processing in the reducer,
> or you need a special grouping to make the reduce work properly having a
> second map/reduce job to sort the data that is just close to balancing I
> would suspect would beat out all of the other options.
>

Thanks, this is a useful suggestion. I'll see if there is a pattern in the
data and from there simply manual define the partitions based on the
pattern we find.

--
Best regards / Met vriendelijke groeten,

Niels Basjes