Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> cost model for MR programs


Copy link to this message
-
Re: cost model for MR programs
@sanjay

Thanks for your reply .. I too was thinking in the same lines .

@Jeff

I found those links very useful... iam working on similar things . Thanks
for reply! :)

On Sat, Aug 29, 2009 at 11:44 AM, Jeff Hammerbacher <[EMAIL PROTECTED]>wrote:

> Hey Bharath,
> There has been some work in the research community on predicting the
> runtime
> of Hadoop MapReduce, Pig, and Hive jobs, though the approaches are not
> always cost-based. At the University of Washington, they've worked on
> progress indicators for Pig; see
> ftp://ftp.cs.washington.edu/tr/2009/07/UW-CSE-09-07-01.PDF. At the
> University of California, Berkeley, they've worked on predicting multiple
> metrics about query progress, first in NeoView, and subsequently in Hadoop;
> see http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf<http://www.cs.berkeley.edu/%7Earchanag/publications/ICDE09.pdf>for the
> NeoView results.
>
> Some preliminary design work has been done in the Hive project for
> collecting statistics on Hive tables. See
> https://issues.apache.org/jira/browse/HIVE-33. Any contributions to this
> work would be much appreciated!
>
> Regards,
> Jeff
>
> On Fri, Aug 28, 2009 at 7:37 PM, indoos <[EMAIL PROTECTED]> wrote:
>
> >
> > Hi,
> > My suggestion would be that we should not be compelling ourselves to
> > compare
> > databases with Hadoop.
> > However, here is something not probably even close to what you may
> require,
> > but might be helpful-
> > 1. Number of nodes - these are the parameters to look for -
> > - average time taken by a single Map and Reduce task (available as part
> of
> > history-analytics),
> > - Max Input file size vs block size. Lets take an example- A 6GB input
> file
> > with 64 MB block size would  ideally require ~1000 Maps. The more you
> want
> > to run these 1000 Maps in parallel, more the number of nodes. A 10 node
> > cluster with 10 Maps would have to run ~10 times in a kind of sequential
> > mode :-(
> > - ultimately it is the time vs cost factor to decide the number of nodes.
> > So
> > for this example, if a map takes at least 2 minutes, the ~minimum time
> > would
> > be 2*10=20 minutes. Less time would mean more nodes.
> > - The number of Jobs that you might decide to run at the same time would
> > also affect the number of nodes. Effectively every individual job task
> > (map/reduce) runs in a sequential kind of mode waiting in the queue for
> the
> > existing/executing map/reduce block to finish. (Off course, we have some
> > prioritization support - this does not however help to finish everything
> in
> > parallel)
> > 2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job
> Tracker,
> > Secondary Name node on the masters side. On slave side- 1 GB RAM each for
> > task tracker and data node which leaves practically not much for good
> > computing on a commodity 8GB machine. The remaining 5-6 GB can then be
> used
> > for Map Reduce tasks. So with our example of running 10 Maps, we would
> have
> > at the most a Map using at max 400-500 MB heap. Anything beyond this
> would
> > require either the Maps to be reduced or the RAM to be increased.
> > 3. Network speed- Hadoop recommends(I think I did read it
> > somewhere-apologies if otherwise) using at least 1 GB/s networks for the
> > heavy data transfer. My experiences with 100 MB/sec in even a dev env
> have
> > been disastrous
> > 4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively
> > available. So given a 4TB hard disk, effectively only 1 TB can be used
> for
> > real data with 2 TB used for replication (3-ideal replication factor) and
> 1
> > TB for temp usage
> > Regards,
> > Sanjay
> >
> >
> > bharath vissapragada-2 wrote:
> > >
> > > Hi all ,
> > >
> > > Is there any general cost model that can be used to guess the run time
> of
> > > a
> > > program (similar to Page IO/s , selectivity factors in RDBMS) in terms
> of
> > > any config aspects such as number of nodes/page IO/s etc .
> > >
> > >
> >
> > --
> > View this message in context:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB