Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> reducer throttling?


Copy link to this message
-
Re: reducer throttling?
Can you describe a bit more about your bulk insert technique? And the way
you control the number of reducers is also by adding artificial ORDER or
GROUP step?

Thanks!

On Thu, Mar 17, 2011 at 1:33 PM, Alex Rovner <[EMAIL PROTECTED]> wrote:

> We use bulk insert technique after the job completes. You can control the
> amount of each bulk insert by controlling the amount of reducers.
>
> Sent from my iPhone
>
> On Mar 17, 2011, at 2:03 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
>
> > We do some processing in hadoop then as the last step, we write the
> result
> > to database. Database is not good at handling hundreds of concurrent
> > connections and fast writes. So we need to throttle down the number of
> tasks
> > that writes to DB. Since we have no control on the number of mappers, we
> add
> > an artificial reducer step to achieve that, either by doing GROUP or
> ORDER,
> > like this:
> >
> > sorted_data = ORDER data BY f1 PARALLEL 10;
> > -- then write sorted_data to DB
> >
> > or
> >
> > grouped_data = GROUP data BY f1 PARALLEL 10;
> > data_to_write = FOREACH grouped_data GENERATE $1;
> >
> > I feel neither is good approach. They just add unnecessary computing
> time,
> > especially the first one. And GROUP may result in too large of bags
> issue.
> >
> > Any better suggestions?
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB