Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Hadoop processing

Copy link to this message
Re: Hadoop processing
Michael Segel 2012-11-08, 15:03
To go back to the OP's initial position.
2 new nodes where data hasn't yet been 'balanced'.

First, that's a small window of time.

But to answer your question...

The JT will attempt to schedule work to where the data is. If you're using 3X replication, there are 3 nodes where the block resides. So you can calculate the odds of getting an open slot to process your data local to its location.

However, if there is an open slot where the data is not located, you will still process the data in that open slot. You lose data locality and that smaller chunk of data will be processed on that node.  So in that case, yes the data is shipped to the node. If you look at your job tracker web page for the results of your processing you will see something in terms of what percentage of the work occurred in terms of data locality. Hadoop is pretty good in that respect.
If you know that the processing time is a couple of orders of magnitude longer than the time it takes to ship the data to a node, you can override the normal characteristic and force the processing to be done remotely. (We've done this and there is a paper on this on InfoQ) [We were bored and didn't like the fact that our Ganglia maps were not all red. We are evil in that way ;-) ] We really don't recommend doing this unless you are either insane or really know what you are doing.



On Nov 8, 2012, at 8:49 AM, Jay Vyas <[EMAIL PROTECTED]> wrote:

> Hmm this is interesting.  I think that:
> 1) For the map phases, hadoop is smart enough to try to run mappers locally, but i think you could force these DNs to actively participate in a Mapper job by decreasing the size of input splits, which would allow for many more mappers, some of which would be forced to run on files which were not necessarily local - in this scenario, those DNs don't yet have any local files on them that would be used for the input.
> 2) For the reducer phases - since of course the reducers will be copying mapper outputs from all over the cluster, one would expect that your Data nodes would naturally take part in this portion of the task if the num.reducers parameter was specified.
> On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <[EMAIL PROTECTED]> wrote:
> Hadoopers,
> “Hadoop ships the code to the data instead of sending the data to the code.”
> Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have not ran the balancer.
> In view of the above quoted statement, will these two nodes not participate in the MapReduce job until you balanced some data onto those nodes? Please kindly elaborate.
> Rgds,
> AK47
> NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
> --
> Jay Vyas
> http://jayunit100.blogspot.com