Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> OutputFormat and Reduce Task


Copy link to this message
-
Re: OutputFormat and Reduce Task
Hi Dhruv,

Inline.

On Fri, Nov 2, 2012 at 4:15 AM, Dhruv <[EMAIL PROTECTED]> wrote:
> I'm trying to optimize the performance of my OutputFormat's implementation.
> I'm doing things similar to HBase's TableOutputFormat--sending the reducer's
> output to a distributed k-v store. So, the context.write() call basically
> winds up doing a Put() on the store.
>
> Although I haven't profiled, a sequence of thread dumps on the reduce tasks
> reveal that the threads are RUNNABLE and hanging out in the put() and its
> subsequent method calls. So, I proceeded to decouple these two by
> implementing the producer (context.write()) consumer (RecordWriter.write())
> pattern using ExecutorService.

With HBase involved, this is only partly correct. The HTable API,
which regular TableOutputFormat uses, provides a "AutoFlush" option
which if disabled, begins to buffer writes to regionservers instead of
doing a flush of Puts/Deletes at every single invoke.

The TableOutputFormat by default does disable AutoFlush, to provide
this behavior.

Read more on that at
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#setAutoFlush(boolean,%20boolean)
and/or in Lars' book, "HBase: The Definitive Guide".

> My understanding is that Context.write() calls RecordWriter.write() and that
> these two are synchronous calls. The first will block until the second
> method completes.Each reduce phase blocks until the context.write()
> finishes, so the next reduce on the next key also blocks, making things run
> slow in my case. Is this correct?

Given the above explanation, this is untrue if HBase's
TableOutputFormat is involved, but true otherwise for general FS
interacting OFs.

> Does this mean that OutputFormat is
> instantiated once by the TaskTracker for the Job's reduce logic and all keys
> operated on by the reducers get the same instance of the OutputFormat. Or,
> is it that for each key operated by the reducer, a new OutputFormat is
> instantiated?

The TaskTracker is a service daemon that does not execute any
user-code. Only a single OutputFormat object is instantiated in a
single Task. The RecordWriter wrapped in it too is only instantiated
once per Task.

> Thanks,
> Dhruv

--
Harsh J