Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Sorting ...


Copy link to this message
-
Re: Sorting ...
Luca Pireddu 2011-05-26, 07:45
On May 25, 2011 22:15:50 Mark question wrote:
> I'm using SequenceFileInputFormat, but then what to write in my mappers?
>
>   each mapper is taking a split from the SequenceInputFile then sort its
> split ?! I don't want that..
>
> Thanks,
> Mark
>
> On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu <[EMAIL PROTECTED]> wrote:
> > On May 25, 2011 01:43:22 Mark question wrote:
> > > Thanks Luca, but what other way to sort a directory of sequence files?
> > >
> > > I don't plan to write a sorting algorithm in mappers/reducers, but
> > > hoping to use the sequenceFile.sorter instead.
> > >
> > > Any ideas?
> > >
> > > Mark
> >
If you want to achieve a global sort, then look at how TeraSort does it:

http://sortbenchmark.org/YahooHadoop.pdf

The idea is to partition the data so that all keys in part[i] are < all keys
in part[i+1].  Each partition in individually sorted, so to read the data in
globally sorted order you simply have to traverse it starting from the first
partition and working your way to the last one.

If your keys are already what you want to sort by, then you don't even need a
mapper (just use the default identity map).

--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452