Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> OutputFormat and Reduce Task


Copy link to this message
-
Re: OutputFormat and Reduce Task
Thanks Harsh, just to be clear--if I have a large key set and if I run with
just one reducer which is the default, the OutputFormat and the
RecordWriter will be constructed only once?
On Thu, Nov 1, 2012 at 8:14 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi Dhruv,
>
> Inline.
>
> On Fri, Nov 2, 2012 at 4:15 AM, Dhruv <[EMAIL PROTECTED]> wrote:
> > I'm trying to optimize the performance of my OutputFormat's
> implementation.
> > I'm doing things similar to HBase's TableOutputFormat--sending the
> reducer's
> > output to a distributed k-v store. So, the context.write() call basically
> > winds up doing a Put() on the store.
> >
> > Although I haven't profiled, a sequence of thread dumps on the reduce
> tasks
> > reveal that the threads are RUNNABLE and hanging out in the put() and its
> > subsequent method calls. So, I proceeded to decouple these two by
> > implementing the producer (context.write()) consumer
> (RecordWriter.write())
> > pattern using ExecutorService.
>
> With HBase involved, this is only partly correct. The HTable API,
> which regular TableOutputFormat uses, provides a "AutoFlush" option
> which if disabled, begins to buffer writes to regionservers instead of
> doing a flush of Puts/Deletes at every single invoke.
>
> The TableOutputFormat by default does disable AutoFlush, to provide
> this behavior.
>
> Read more on that at
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#setAutoFlush(boolean,%20boolean)
> and/or in Lars' book, "HBase: The Definitive Guide".
>
> > My understanding is that Context.write() calls RecordWriter.write() and
> that
> > these two are synchronous calls. The first will block until the second
> > method completes.Each reduce phase blocks until the context.write()
> > finishes, so the next reduce on the next key also blocks, making things
> run
> > slow in my case. Is this correct?
>
> Given the above explanation, this is untrue if HBase's
> TableOutputFormat is involved, but true otherwise for general FS
> interacting OFs.
>
> > Does this mean that OutputFormat is
> > instantiated once by the TaskTracker for the Job's reduce logic and all
> keys
> > operated on by the reducers get the same instance of the OutputFormat.
> Or,
> > is it that for each key operated by the reducer, a new OutputFormat is
> > instantiated?
>
> The TaskTracker is a service daemon that does not execute any
> user-code. Only a single OutputFormat object is instantiated in a
> single Task. The RecordWriter wrapped in it too is only instantiated
> once per Task.
>
> > Thanks,
> > Dhruv
>
>
>
> --
> Harsh J
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB