On May 25, 2011 22:15:50 Mark question wrote:
> I'm using SequenceFileInputFormat, but then what to write in my mappers?
> each mapper is taking a split from the SequenceInputFile then sort its
> split ?! I don't want that..
> On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu <[EMAIL PROTECTED]> wrote:
> > On May 25, 2011 01:43:22 Mark question wrote:
> > > Thanks Luca, but what other way to sort a directory of sequence files?
> > >
> > > I don't plan to write a sorting algorithm in mappers/reducers, but
> > > hoping to use the sequenceFile.sorter instead.
> > >
> > > Any ideas?
> > >
> > > Mark
If you want to achieve a global sort, then look at how TeraSort does it:
The idea is to partition the data so that all keys in part[i] are < all keys
in part[i+1]. Each partition in individually sorted, so to read the data in
globally sorted order you simply have to traverse it starting from the first
partition and working your way to the last one.
If your keys are already what you want to sort by, then you don't even need a
mapper (just use the default identity map).
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel: +39 0709250452