Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Could pig dynamic change the reduce number according the mapper task number ?


Copy link to this message
-
RE: Could pig dynamic change the reduce number according the mapper task  number ?
I was hoping that the cost based optimizer being developed by Ashutosh
and Dmitriy will address this issue.

Santhosh

-----Original Message-----
From: Alan Gates [mailto:[EMAIL PROTECTED]]
Sent: Thursday, November 12, 2009 8:26 AM
To: [EMAIL PROTECTED]
Cc: David (Ciemo) Ciemiewicz
Subject: Re: Could pig dynamic change the reduce number according the
mapper task number ?

I agree that it would be very useful to have a dynamic number of
reducers.  However, I'm not sure how to accomplish it.  MapReduce
requires that we set the number of reducers up front in JobConf, when we
submit the job.  But we don't know the number of maps until getSplits is
called after job submission.  I don't think MR will allow us to set the
number of reducers once the job is started.

Others have suggested that we use the file size to specify the number of
reducers.  We cannot always assume the inputs are HDFS files (it could
be from HBase or something).  Also different storage formats (text,
sequence files, zebra) would need different ratios of bytes to reducers
since they store data at different compression rates.  Maybe this could
still work assuming, only in the HDFS case, with the assumption that the
user understands the compression ratios and thus can set the reducer
input accordingly.  But I'm not sure this will be simple enough to be
useful.

Thoughts?

Alan.
On Nov 12, 2009, at 12:12 AM, Jeff Zhang wrote:

> Hi all,
>
> Often, I will run one script on different data set. Sometimes small
> data set and sometimes large data set. And different size of data set
> require different number of reducers.
> I know that the default reduce number is 1, and users can change the
> reduce number in script by keywords parallel.
>
> But I do not want to be bothered to change reduce number in script
> each time I run script.
> So I have an idea that could pig provide some API that users can set
> the ratio between map task and reduce task. (and some new keyword in
> pig latin to set the ratio)
>
> e.g. If I set the ratio to be 2:1, then if I have 100 map tasks, it
> will have 50 reduce task accordingly.
>
> I think it will be convenient for pig users.
>
>
> Jeff Zhang