-Re: cost model for MR programs
Jeff Hammerbacher 2009-08-29, 06:14
There has been some work in the research community on predicting the runtime
of Hadoop MapReduce, Pig, and Hive jobs, though the approaches are not
always cost-based. At the University of Washington, they've worked on
progress indicators for Pig; see
ftp://ftp.cs.washington.edu/tr/2009/07/UW-CSE-09-07-01.PDF. At the
University of California, Berkeley, they've worked on predicting multiple
metrics about query progress, first in NeoView, and subsequently in Hadoop;
see http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf for the
Some preliminary design work has been done in the Hive project for
collecting statistics on Hive tables. See
https://issues.apache.org/jira/browse/HIVE-33. Any contributions to this
work would be much appreciated!
On Fri, Aug 28, 2009 at 7:37 PM, indoos <[EMAIL PROTECTED]> wrote:
> My suggestion would be that we should not be compelling ourselves to
> databases with Hadoop.
> However, here is something not probably even close to what you may require,
> but might be helpful-
> 1. Number of nodes - these are the parameters to look for -
> - average time taken by a single Map and Reduce task (available as part of
> - Max Input file size vs block size. Lets take an example- A 6GB input file
> with 64 MB block size would ideally require ~1000 Maps. The more you want
> to run these 1000 Maps in parallel, more the number of nodes. A 10 node
> cluster with 10 Maps would have to run ~10 times in a kind of sequential
> mode :-(
> - ultimately it is the time vs cost factor to decide the number of nodes.
> for this example, if a map takes at least 2 minutes, the ~minimum time
> be 2*10=20 minutes. Less time would mean more nodes.
> - The number of Jobs that you might decide to run at the same time would
> also affect the number of nodes. Effectively every individual job task
> (map/reduce) runs in a sequential kind of mode waiting in the queue for the
> existing/executing map/reduce block to finish. (Off course, we have some
> prioritization support - this does not however help to finish everything in
> 2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job Tracker,
> Secondary Name node on the masters side. On slave side- 1 GB RAM each for
> task tracker and data node which leaves practically not much for good
> computing on a commodity 8GB machine. The remaining 5-6 GB can then be used
> for Map Reduce tasks. So with our example of running 10 Maps, we would have
> at the most a Map using at max 400-500 MB heap. Anything beyond this would
> require either the Maps to be reduced or the RAM to be increased.
> 3. Network speed- Hadoop recommends(I think I did read it
> somewhere-apologies if otherwise) using at least 1 GB/s networks for the
> heavy data transfer. My experiences with 100 MB/sec in even a dev env have
> been disastrous
> 4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively
> available. So given a 4TB hard disk, effectively only 1 TB can be used for
> real data with 2 TB used for replication (3-ideal replication factor) and 1
> TB for temp usage
> bharath vissapragada-2 wrote:
> > Hi all ,
> > Is there any general cost model that can be used to guess the run time of
> > a
> > program (similar to Page IO/s , selectivity factors in RDBMS) in terms of
> > any config aspects such as number of nodes/page IO/s etc .
> View this message in context:
> Sent from the Hadoop core-user mailing list archive at Nabble.com.