Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Single JVM, many tasks: How do I know when I'm on the last map task

Copy link to this message
Single JVM, many tasks: How do I know when I'm on the last map task
Saptarshi Guha 2013-02-22, 20:17

In my Java Hadoop job, i have reset the reuse variable to be -1.
hence a JVM will process multiple tasks.

I have also seen to it that instead of writing to the job context, the
keys and values are accumulated in a hashtable.
When the bytes written to this table reach BUFSIZE (e..g 150MB)
i call my reducer(or what some call combiner) (inside the map task).

However if BUFSIZE is never accumulated my reducer is never called.
So i have to flush it. Now I could flush this in the map classes
'cleanup' method. In that case, the data would be rewritten to the
same hashtable.

But at one point this hashtable must be written to the job context
onto the Hadoop Reduce stage. The way i see it, if i intend to share
this hashtable across map tasks (within the same JVM), i need to know
when the JVM has reached it's final map task. When that is complete,
then i know i *must* flush this to the job context.

Hopefully i've been some what clear. Does Hadoop 0.20.2 have an API
that tells the child JVM if it's on the last map task?