Ted Dunning 2013-03-25, 07:26
I would agree with David that this is not normally a good idea.
There are situations, however, where you do need to control location of
data and where the computation occurs. These requirements, however,
normally only come up in real-time or low-latency situations.
Ordinary Hadoop does not address those needs, by design. This allows
Hadoop to have a much simpler implementation and to handle a varied batch
oriented workload with pretty high efficiency.
If you really need to handle real-time file update and access and to
control file locations, then you need to look beyond Hadoop to extensions
such as MapR which allow this control and have the required real-time file
On Mon, Mar 25, 2013 at 8:16 AM, David Parks <[EMAIL PROTECTED]> wrote:
> Can I suggest an answer of "Yes, but you probably don't want to"?
> As a "typical user" of Hadoop you would not do this. Hadoop already chooses
> the best server to do the work based on the location of the data (a server
> that is available to do work and also has the data locally will generally
> assigned to do that work). There are a couple of mechanisms for which you
> can do this. Neither of which I'm terribly familiar with so I'll just
> provide a brief introduction and you can research more deeply and ask more
> pointed questions.
> I believe there is some ability to "suggest" a good location to run a
> particular task in the InputFormat, thus if you extended, say,
> FileInputFormat you could inject some kind of recommendation, but it
> wouldn't force Hadoop to do one thing or another, it would just be a
> The next place I'd look is at the scheduler, but you're gonna really get
> your hands dirty by digging in here and I doubt, from the tone of your
> email, that you'll have interest in digging to this level.
> But mostly, I would suggest you explain your use case more thoroughly and I
> bet you'll just be directed down a more logical path to accomplish your
> -----Original Message-----
> From: Fan Bai [mailto:[EMAIL PROTECTED]]
> Sent: Monday, March 25, 2013 5:24 AM
> To: [EMAIL PROTECTED]
> Dear Sir,
> I have a question about Hadoop, when I use Hadoop and Mapreduce to finish a
> job (only one job in here), can I control the file to work in which node?
> For example, I have only one job and this job have 10 files (10 mapper need
> to run). Also in my severs, I have one head node and four working node. My
> question is: can I control those 10 files to working in which node? Such
> No.1 file work in node1, No.3 file work in node2, No.5 file work in node3
> and No.8 file work in node4.
> If I can do this, that means I can control the task. Is that means I still
> can control this file in next around (I have a loop in head node;I can do
> another mapreduce work). For example, I can set up No.5 file in 1st around
> worked node3 and I also can set up No.5 file work in node 2 in 2nd around.
> If I cannot, is that means, for Hadoop, the file will work in which node
> just like a "black box", the user cannot control the file will work in
> node, because you think the user do not need control it, just let HDFS help
> them to finish the parallel work.
> Therefore, the Hadoop cannot control the task in one job, but can control
> the multiple jobs.
> Thank you so much!
> Fan Bai
> PhD Candidate
> Computer Science Department
> Georgia State University
> Atlanta, GA 30303