Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Merge sorting reduce output files


Copy link to this message
-
Re: Merge sorting reduce output files
Hi Robert,

On Tue, Feb 28, 2012 at 21:41, Robert Evans <[EMAIL PROTECTED]> wrote:

>  I would recommend that you do what terrasort does and use a different
> partitioner, to ensure that all keys within a given range will go to a
> single reducer.  If your partitioner is set up correctly then all you have
> to do is to concatenate the files together, if you even need to do that.
>
> Look at TotalOrderPartitioner.  It should do what you want.
>

I know about that partitioner.
The trouble I have is comming up with a partitioning that "evenly" balances
the data for this specific problem.
Taking a sample and base the partitioning on that (like the one used in
terrasort) wouldn't help.
The data has a special distribution...
Niels Basjes

>
> --Bobby Evans
>
>
> On 2/28/12 2:10 PM, "Niels Basjes" <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> We have a job that outputs a set of files that are several hundred MB of
> text each.
> Using the comparators and such we can produce output files that are each
> sorted by themselves.
>
> What we want is to have one giant outputfile (outside of the cluster) that
> is sorted.
>
> Now we see the following options:
> 1) Run the last job with 1 reducer. This is not really an option because
> that would put a significant part of the processing time through 1 cpu
> (this would take too long).
> 2) Create an additional job that sorts the existing files and has 1
> reducer.
> 3) Download all of the files and run the standard commandline tool "sort
> -m"
> 4) Install HDFS fuse and run the standard commandline tool "sort -m"
> 5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort
> -m" in one go.
>
> During our discussion we were wondering: What is the best way of doing
> this?
> What do you recommend?
>
>
--
Best regards / Met vriendelijke groeten,

Niels Basjes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB