Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >>


Can I suggest an answer of "Yes, but  you probably don't want to"?

As a "typical user" of Hadoop you would not do this. Hadoop already chooses
the best server to do the work based on the location of the data (a server
that is available to do work and also has the data locally will generally be
assigned to do that work). There are a couple of mechanisms for which you
can do this. Neither of which I'm terribly familiar with so I'll just
provide a brief introduction and you can research more deeply and ask more
pointed questions.

I believe there is some ability to "suggest" a good location to run a
particular task in the InputFormat, thus if you extended, say,
FileInputFormat you could inject some kind of recommendation, but it
wouldn't force Hadoop to do one thing or another, it would just be a
recommendation.

The next place I'd look is at the scheduler, but you're gonna really get
your hands dirty by digging in here and I doubt, from the tone of your
email, that you'll have interest in digging to this level.

But mostly, I would suggest you explain your use case more thoroughly and I
bet you'll just be directed down a more logical path to accomplish your
goals.

David
-----Original Message-----
From: Fan Bai [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 25, 2013 5:24 AM
To: [EMAIL PROTECTED]
Subject:
Dear Sir,

I have a question about Hadoop, when I use Hadoop and Mapreduce to finish a
job (only one job in here), can I control the file to work in which node?

For example, I have only one job and this job have 10 files (10 mapper need
to run). Also in my severs, I have one head node and four working node. My
question is: can I control those 10 files to working in which node? Such as:
No.1 file work in node1, No.3 file work in node2, No.5 file work in node3
and No.8 file work in node4.

If I can do this, that means I can control the task. Is that means I still
can control this file in next around (I have a loop in head node;I can do
another mapreduce work). For example, I can set up No.5 file in 1st around
worked node3 and I also can set up No.5 file work in node 2 in 2nd around.

If I cannot, is that means, for Hadoop, the file will work in which node
just like a "black box", the user cannot control the file will work in which
node, because you think the user do not need control it, just let HDFS help
them to finish the parallel work.
Therefore, the Hadoop cannot control the task in one job, but can control
the multiple jobs.

Thank you so much!

Fan Bai
PhD Candidate
Computer Science Department
Georgia State University
Atlanta, GA 30303
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB