Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> reducer throttling?

Copy link to this message
reducer throttling?
We do some processing in hadoop then as the last step, we write the result
to database. Database is not good at handling hundreds of concurrent
connections and fast writes. So we need to throttle down the number of tasks
that writes to DB. Since we have no control on the number of mappers, we add
an artificial reducer step to achieve that, either by doing GROUP or ORDER,
like this:

sorted_data = ORDER data BY f1 PARALLEL 10;
-- then write sorted_data to DB


grouped_data = GROUP data BY f1 PARALLEL 10;
data_to_write = FOREACH grouped_data GENERATE $1;

I feel neither is good approach. They just add unnecessary computing time,
especially the first one. And GROUP may result in too large of bags issue.

Any better suggestions?