Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Newbie - question - how do I use Hadoop to sort a very large file


Copy link to this message
-
Newbie - question - how do I use Hadoop to sort a very large file
Steve Lewis 2010-06-23, 17:15
Assume I have a large file called *BigData.unsorted*  ( say 500GB)
consisting of lines of text. Assume that these lines are in random order -
I understand how to assign a key to lines and that Hadoop will pass the
lines to my reducers in order of that key.

Now assume I want a single file called *BigData.sorted*  with the lines in
the order of the keys.

I think I understand how to get files part00000, part000001 ,,, but not
1) How I get just the lines from the reducer not the keys
2) How I  make the reducer generate a file with the name that I want "*
BigData.sorted"*
*3) How without using a single reducer instance I get a single output file
or is a single reducer the right choice for this task.*
*
*
*Also it would be very nice if the output of the reducer were compressed -
say BigData.sorted.gz *
*
*
*Any suggestions
*--
Steven M. Lewis PhD
Institute for Systems Biology
Seattle WA