Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Best practices for jobs with large Map output


Copy link to this message
-
Re: Best practices for jobs with large Map output
Chris Douglas 2011-04-19, 18:47
On Mon, Apr 18, 2011 at 3:42 AM, Shai Erera <[EMAIL PROTECTED]> wrote:
> I ended up doing the following -- my HDFS Mapper creates an index in-memory
> and then serializes the in-memory index into a single file that is stored on
> HDFS (each Mapper serializes to a different file). I use FileSystem API to
> achieve that, so hopefully it's the way to do it. The Mapper outputs a Text
> value which is the location on HDFS. The Reducer then interprets that value
> and reads the file using FileSystem API, and deserialize it into an
> in-memory Lucene index.

Without knowing the format of a Lucene index, I can't say whether this
approach makes sense. Instead of handling the cleanup yourself, you
might consider running the index generation and the concat as separate
parts of your workflow (as Harsh suggested). -C