Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Newbie - question - how do I use Hadoop to sort a very large file


Copy link to this message
-
Newbie - question - how do I use Hadoop to sort a very large file
Assume I have a large file called *BigData.unsorted*  ( say 500GB)
consisting of lines of text. Assume that these lines are in random order -
I understand how to assign a key to lines and that Hadoop will pass the
lines to my reducers in order of that key.

Now assume I want a single file called *BigData.sorted*  with the lines in
the order of the keys.

I think I understand how to get files part00000, part000001 ,,, but not
1) How I get just the lines from the reducer not the keys
2) How I  make the reducer generate a file with the name that I want "*
BigData.sorted"*
*3) How without using a single reducer instance I get a single output file
or is a single reducer the right choice for this task.*
*
*
*Also it would be very nice if the output of the reducer were compressed -
say BigData.sorted.gz *
*
*
*Any suggestions
*--
Steven M. Lewis PhD
Institute for Systems Biology
Seattle WA
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB