Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Understanding of the hadoop distribution system (tuning)


Copy link to this message
-
Re: Understanding of the hadoop distribution system (tuning)
Hi,

Responses inline to some points.

On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I'm new to hadoop and i've just played around with map reduce.
> I would like to check if my understanding to hadoop is correct and i
> would appreciate if anyone could correct me if i'm wrong.
>
> I have a data of around 518MB, and i wrote a MR program to process it.
> Here are some of my settings in my mapred-site.xml.
> ---------------------------------------------------------------
> mapred.tasktracker.map.tasks.maximum = 20
> mapred.tasktracker.reduce.tasks.maximum = 20
> ---------------------------------------------------------------
>

These two configurations essentially tell the tasktrackers that they can
run 20 maps and 20 reduces in parallel on a machine. Is this what you
intended ? (Generally the sum of these two values should equal the number
of cores on your tasktracker node, or a little more).

Also, would help if you can tell us your cluster size - i.e. number of
slaves.
> My block size is default, 64MB
> With my data size = 518MB, i guess setting the maximum for MR task to 20
> is far more than enough (518/64 = 8) , did i get it correctly?
>
>
I suppose what you want is to run all the maps in parallel. For that, the
number of map slots in your cluster should be more than the number of maps
of your job (assuming there's a single job running). If the number of slots
is less than number of maps, the maps would be scheduled in multiple waves.
On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
you the total slots available in your cluster.

> When i run the MR program, i could see in the Map/Reduce Administration
> page that the number of Maps Total = 8, so i assume that everything is
> going well here, once again if i'm wrong please correct me.
> (Sometimes it shows only Maps Total = 3)
>
>
This value tells us the number of maps that will run for the job.
> There's one thing which i'm uncertain about hadoop distribution.
> Is the Maps Total = 8 means that there are 8 map tasks split among all
> the data nodes (task trackers)?
> Is there anyway i can checked whether all the tasks are shared among
> datanodes (where task trackers are working).
>

There's no easy way to check this. The task page for every task shows the
attempts that ran for each task and where they ran under the 'Machine'
column.
> When i clicked on each link under that Task Id, i can see there's "Input
> Split Locations" stated under each task details, if the inputs are
> splitted between data nodes, does that means that everything is working
> well?
>
>
I think this is just the location of the splits, including the replicas.
What you could see is if enough data local maps ran - which means that the
tasks mostly got their inputs from datanodes running on the same machine as
themselves. This is given by the counter "Data-local map tasks" on the job
UI page.
> I need to make sure i got everything running well because my MR took
> around 6 hours to finish despite the input size is small.. (Well, i know
> hadoop is not meant for small data), I'm not sure whether it's my
> configuration that goes wrong or hadoop is just not suitable for my case.
> I'm actually running a mahout kmeans analysis.
>
> Thank you for your time.
>
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB