-Re: Merge sorting reduce output files
Niels Basjes 2012-02-28, 21:46
On Tue, Feb 28, 2012 at 21:41, Robert Evans <[EMAIL PROTECTED]> wrote:
> I would recommend that you do what terrasort does and use a different
> partitioner, to ensure that all keys within a given range will go to a
> single reducer. If your partitioner is set up correctly then all you have
> to do is to concatenate the files together, if you even need to do that.
> Look at TotalOrderPartitioner. It should do what you want.
I know about that partitioner.
The trouble I have is comming up with a partitioning that "evenly" balances
the data for this specific problem.
Taking a sample and base the partitioning on that (like the one used in
terrasort) wouldn't help.
The data has a special distribution...
> --Bobby Evans
> On 2/28/12 2:10 PM, "Niels Basjes" <[EMAIL PROTECTED]> wrote:
> We have a job that outputs a set of files that are several hundred MB of
> text each.
> Using the comparators and such we can produce output files that are each
> sorted by themselves.
> What we want is to have one giant outputfile (outside of the cluster) that
> is sorted.
> Now we see the following options:
> 1) Run the last job with 1 reducer. This is not really an option because
> that would put a significant part of the processing time through 1 cpu
> (this would take too long).
> 2) Create an additional job that sorts the existing files and has 1
> 3) Download all of the files and run the standard commandline tool "sort
> 4) Install HDFS fuse and run the standard commandline tool "sort -m"
> 5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort
> -m" in one go.
> During our discussion we were wondering: What is the best way of doing
> What do you recommend?
Best regards / Met vriendelijke groeten,