Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> OutputFormat and Reduce Task


Copy link to this message
-
Re: OutputFormat and Reduce Task
Hi Dhruv,

Inline.

On Fri, Nov 2, 2012 at 4:15 AM, Dhruv <[EMAIL PROTECTED]> wrote:
> I'm trying to optimize the performance of my OutputFormat's implementation.
> I'm doing things similar to HBase's TableOutputFormat--sending the reducer's
> output to a distributed k-v store. So, the context.write() call basically
> winds up doing a Put() on the store.
>
> Although I haven't profiled, a sequence of thread dumps on the reduce tasks
> reveal that the threads are RUNNABLE and hanging out in the put() and its
> subsequent method calls. So, I proceeded to decouple these two by
> implementing the producer (context.write()) consumer (RecordWriter.write())
> pattern using ExecutorService.

With HBase involved, this is only partly correct. The HTable API,
which regular TableOutputFormat uses, provides a "AutoFlush" option
which if disabled, begins to buffer writes to regionservers instead of
doing a flush of Puts/Deletes at every single invoke.

The TableOutputFormat by default does disable AutoFlush, to provide
this behavior.

Read more on that at
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#setAutoFlush(boolean,%20boolean)
and/or in Lars' book, "HBase: The Definitive Guide".

> My understanding is that Context.write() calls RecordWriter.write() and that
> these two are synchronous calls. The first will block until the second
> method completes.Each reduce phase blocks until the context.write()
> finishes, so the next reduce on the next key also blocks, making things run
> slow in my case. Is this correct?

Given the above explanation, this is untrue if HBase's
TableOutputFormat is involved, but true otherwise for general FS
interacting OFs.

> Does this mean that OutputFormat is
> instantiated once by the TaskTracker for the Job's reduce logic and all keys
> operated on by the reducers get the same instance of the OutputFormat. Or,
> is it that for each key operated by the reducer, a new OutputFormat is
> instantiated?

The TaskTracker is a service daemon that does not execute any
user-code. Only a single OutputFormat object is instantiated in a
single Task. The RecordWriter wrapped in it too is only instantiated
once per Task.

> Thanks,
> Dhruv

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB