Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - OutputFormat and Reduce Task


Copy link to this message
-
Re: OutputFormat and Reduce Task
Dhruv 2012-11-02, 17:35
Thanks Harsh, just to be clear--if I have a large key set and if I run with
just one reducer which is the default, the OutputFormat and the
RecordWriter will be constructed only once?
On Thu, Nov 1, 2012 at 8:14 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi Dhruv,
>
> Inline.
>
> On Fri, Nov 2, 2012 at 4:15 AM, Dhruv <[EMAIL PROTECTED]> wrote:
> > I'm trying to optimize the performance of my OutputFormat's
> implementation.
> > I'm doing things similar to HBase's TableOutputFormat--sending the
> reducer's
> > output to a distributed k-v store. So, the context.write() call basically
> > winds up doing a Put() on the store.
> >
> > Although I haven't profiled, a sequence of thread dumps on the reduce
> tasks
> > reveal that the threads are RUNNABLE and hanging out in the put() and its
> > subsequent method calls. So, I proceeded to decouple these two by
> > implementing the producer (context.write()) consumer
> (RecordWriter.write())
> > pattern using ExecutorService.
>
> With HBase involved, this is only partly correct. The HTable API,
> which regular TableOutputFormat uses, provides a "AutoFlush" option
> which if disabled, begins to buffer writes to regionservers instead of
> doing a flush of Puts/Deletes at every single invoke.
>
> The TableOutputFormat by default does disable AutoFlush, to provide
> this behavior.
>
> Read more on that at
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#setAutoFlush(boolean,%20boolean)
> and/or in Lars' book, "HBase: The Definitive Guide".
>
> > My understanding is that Context.write() calls RecordWriter.write() and
> that
> > these two are synchronous calls. The first will block until the second
> > method completes.Each reduce phase blocks until the context.write()
> > finishes, so the next reduce on the next key also blocks, making things
> run
> > slow in my case. Is this correct?
>
> Given the above explanation, this is untrue if HBase's
> TableOutputFormat is involved, but true otherwise for general FS
> interacting OFs.
>
> > Does this mean that OutputFormat is
> > instantiated once by the TaskTracker for the Job's reduce logic and all
> keys
> > operated on by the reducers get the same instance of the OutputFormat.
> Or,
> > is it that for each key operated by the reducer, a new OutputFormat is
> > instantiated?
>
> The TaskTracker is a service daemon that does not execute any
> user-code. Only a single OutputFormat object is instantiated in a
> single Task. The RecordWriter wrapped in it too is only instantiated
> once per Task.
>
> > Thanks,
> > Dhruv
>
>
>
> --
> Harsh J
>