Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> cost model for MR programs

Copy link to this message
Re: cost model for MR programs
Hey Bharath,
There has been some work in the research community on predicting the runtime
of Hadoop MapReduce, Pig, and Hive jobs, though the approaches are not
always cost-based. At the University of Washington, they've worked on
progress indicators for Pig; see
ftp://ftp.cs.washington.edu/tr/2009/07/UW-CSE-09-07-01.PDF. At the
University of California, Berkeley, they've worked on predicting multiple
metrics about query progress, first in NeoView, and subsequently in Hadoop;
see http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf for the
NeoView results.

Some preliminary design work has been done in the Hive project for
collecting statistics on Hive tables. See
https://issues.apache.org/jira/browse/HIVE-33. Any contributions to this
work would be much appreciated!


On Fri, Aug 28, 2009 at 7:37 PM, indoos <[EMAIL PROTECTED]> wrote:

> Hi,
> My suggestion would be that we should not be compelling ourselves to
> compare
> databases with Hadoop.
> However, here is something not probably even close to what you may require,
> but might be helpful-
> 1. Number of nodes - these are the parameters to look for -
> - average time taken by a single Map and Reduce task (available as part of
> history-analytics),
> - Max Input file size vs block size. Lets take an example- A 6GB input file
> with 64 MB block size would  ideally require ~1000 Maps. The more you want
> to run these 1000 Maps in parallel, more the number of nodes. A 10 node
> cluster with 10 Maps would have to run ~10 times in a kind of sequential
> mode :-(
> - ultimately it is the time vs cost factor to decide the number of nodes.
> So
> for this example, if a map takes at least 2 minutes, the ~minimum time
> would
> be 2*10=20 minutes. Less time would mean more nodes.
> - The number of Jobs that you might decide to run at the same time would
> also affect the number of nodes. Effectively every individual job task
> (map/reduce) runs in a sequential kind of mode waiting in the queue for the
> existing/executing map/reduce block to finish. (Off course, we have some
> prioritization support - this does not however help to finish everything in
> parallel)
> 2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job Tracker,
> Secondary Name node on the masters side. On slave side- 1 GB RAM each for
> task tracker and data node which leaves practically not much for good
> computing on a commodity 8GB machine. The remaining 5-6 GB can then be used
> for Map Reduce tasks. So with our example of running 10 Maps, we would have
> at the most a Map using at max 400-500 MB heap. Anything beyond this would
> require either the Maps to be reduced or the RAM to be increased.
> 3. Network speed- Hadoop recommends(I think I did read it
> somewhere-apologies if otherwise) using at least 1 GB/s networks for the
> heavy data transfer. My experiences with 100 MB/sec in even a dev env have
> been disastrous
> 4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively
> available. So given a 4TB hard disk, effectively only 1 TB can be used for
> real data with 2 TB used for replication (3-ideal replication factor) and 1
> TB for temp usage
> Regards,
> Sanjay
> bharath vissapragada-2 wrote:
> >
> > Hi all ,
> >
> > Is there any general cost model that can be used to guess the run time of
> > a
> > program (similar to Page IO/s , selectivity factors in RDBMS) in terms of
> > any config aspects such as number of nodes/page IO/s etc .
> >
> >
> --
> View this message in context:
> http://www.nabble.com/cost-model-for-MR-programs-tp25127531p25199508.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.