|
|
-
What's the basic idea of pseudo-distributed Hadoop ?
Jason Yang 2012-09-14, 06:03
Hi, all
I have a question about how does the pseudo-distributed Hadoop cluster work:
As many map tasks are submitted to the pseudo-distributed Hadoop cluster, does the hadoop run each mapper in sequence ? or does it run these mappers in different threads or something could be parallel?
-- YANG, Lin
-
Re: What's the basic idea of pseudo-distributed Hadoop ?
Kai Voigt 2012-09-14, 06:08
Hello.
Am 14.09.2012 um 08:03 schrieb Jason Yang <[EMAIL PROTECTED]>:
> I have a question about how does the pseudo-distributed Hadoop cluster work: > > As many map tasks are submitted to the pseudo-distributed Hadoop cluster, does the hadoop run each mapper in sequence ? or does it run these mappers in different threads or something could be parallel?
pseudo-distributed mode is a one node cluster. You have a namenode, a jobtracker, and a single datanode and tasktracker running. You can verify with "jps" command.
The default setting is that a tasktracker can run up to two map and reduce tasks in parallel (mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum), so you will actually see some concurrency on your one machine.
Kai
-- Kai Voigt [EMAIL PROTECTED]
-
Re: What's the basic idea of pseudo-distributed Hadoop ?
Jason Yang 2012-09-14, 06:28
Hey, Kai
Thanks for you reply.
I was wondering what's difference btw the pseudo-distributed and fully-distributed hadoop, except the maximum number of map/reduce.
And if a MR program works fine in pseudo-distributed cluster, will it work exactly fine in the fully-distributed cluster ?
2012/9/14 Kai Voigt <[EMAIL PROTECTED]>
> e default setting is that a tasktracker can run up to two map and reduce > tasks in parallel (mapred.tasktracker.map.tasks.maximum and > mapred.tasktracker.reduce.tasks.maximum), so you will actually see some > concurrency on your one machine. >
-- YANG, Lin
-
Re: What's the basic idea of pseudo-distributed Hadoop ?
Harsh J 2012-09-14, 07:24
Hi Jason,
I think you're confusing the standalone mode with a pseudo-distributed mode. The former is a limited mode of MR where no daemons need to be deployed and the tasks run in a single JVM (via threads).
A pseudo distributed cluster is a cluster where all daemons are running on one node itself. Hence, not "distributed" in the sense of multi-nodes (no use of an network gear) but works in the same way between nodes (RPC, etc.) as a fully-distributed one.
If an MR program works fine in a pseudo-distributed mode, it "should" work (no guarantee) fine in a fully-distributed mode iff all nodes have the same arch/OS, same JVM, and job-specific configurations. This is because tasks execute on various nodes and may be affected by the node's behavior or setup that is different from others - and thats something you'd have to detect/know about if it exhibits failures more than others.
On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <[EMAIL PROTECTED]> wrote: > Hey, Kai > > Thanks for you reply. > > I was wondering what's difference btw the pseudo-distributed and > fully-distributed hadoop, except the maximum number of map/reduce. > > And if a MR program works fine in pseudo-distributed cluster, will it work > exactly fine in the fully-distributed cluster ? > > > 2012/9/14 Kai Voigt <[EMAIL PROTECTED]> >> >> e default setting is that a tasktracker can run up to two map and reduce >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some >> concurrency on your one machine. > > > > > -- > YANG, Lin >
-- Harsh J
-
Re: What's the basic idea of pseudo-distributed Hadoop ?
Bertrand Dechoux 2012-09-14, 07:31
The only difference between pseudo-distributed and fully distributed would be scale. You could say that code that runs fine on the former, runs fine too on the latter. But it does not necessary mean that the performance will scale the same way (ie if you keep a list of elements in memory, at bigger scale you could receive OOME).
Of course, like it has been implied in previous answers, you can't say the same with standalone. With this mode, you could use a global mutable static state thinking it's fine without caring about distribution between the nodes. In that case, the same code launched on pseudo-distributed will fail to replicate the same results.
Regards
Bertrand
On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> Hi Jason, > > I think you're confusing the standalone mode with a pseudo-distributed > mode. The former is a limited mode of MR where no daemons need to be > deployed and the tasks run in a single JVM (via threads). > > A pseudo distributed cluster is a cluster where all daemons are > running on one node itself. Hence, not "distributed" in the sense of > multi-nodes (no use of an network gear) but works in the same way > between nodes (RPC, etc.) as a fully-distributed one. > > If an MR program works fine in a pseudo-distributed mode, it "should" > work (no guarantee) fine in a fully-distributed mode iff all nodes > have the same arch/OS, same JVM, and job-specific configurations. This > is because tasks execute on various nodes and may be affected by the > node's behavior or setup that is different from others - and thats > something you'd have to detect/know about if it exhibits failures more > than others. > > On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <[EMAIL PROTECTED]> > wrote: > > Hey, Kai > > > > Thanks for you reply. > > > > I was wondering what's difference btw the pseudo-distributed and > > fully-distributed hadoop, except the maximum number of map/reduce. > > > > And if a MR program works fine in pseudo-distributed cluster, will it > work > > exactly fine in the fully-distributed cluster ? > > > > > > 2012/9/14 Kai Voigt <[EMAIL PROTECTED]> > >> > >> e default setting is that a tasktracker can run up to two map and reduce > >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and > >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some > >> concurrency on your one machine. > > > > > > > > > > -- > > YANG, Lin > > > > > > -- > Harsh J >
-- Bertrand Dechoux
-
Re: What's the basic idea of pseudo-distributed Hadoop ?
Jason Yang 2012-09-14, 07:34
All right, I got it.
Thanks for all of you.
2012/9/14 Bertrand Dechoux <[EMAIL PROTECTED]>
> The only difference between pseudo-distributed and fully distributed would > be scale. You could say that code that runs fine on the former, runs fine > too on the latter. But it does not necessary mean that the performance will > scale the same way (ie if you keep a list of elements in memory, at bigger > scale you could receive OOME). > > Of course, like it has been implied in previous answers, you can't say the > same with standalone. With this mode, you could use a global mutable static > state thinking it's fine without caring about distribution between the > nodes. In that case, the same code launched on pseudo-distributed will fail > to replicate the same results. > > Regards > > Bertrand > > > On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <[EMAIL PROTECTED]> wrote: > >> Hi Jason, >> >> I think you're confusing the standalone mode with a pseudo-distributed >> mode. The former is a limited mode of MR where no daemons need to be >> deployed and the tasks run in a single JVM (via threads). >> >> A pseudo distributed cluster is a cluster where all daemons are >> running on one node itself. Hence, not "distributed" in the sense of >> multi-nodes (no use of an network gear) but works in the same way >> between nodes (RPC, etc.) as a fully-distributed one. >> >> If an MR program works fine in a pseudo-distributed mode, it "should" >> work (no guarantee) fine in a fully-distributed mode iff all nodes >> have the same arch/OS, same JVM, and job-specific configurations. This >> is because tasks execute on various nodes and may be affected by the >> node's behavior or setup that is different from others - and thats >> something you'd have to detect/know about if it exhibits failures more >> than others. >> >> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <[EMAIL PROTECTED]> >> wrote: >> > Hey, Kai >> > >> > Thanks for you reply. >> > >> > I was wondering what's difference btw the pseudo-distributed and >> > fully-distributed hadoop, except the maximum number of map/reduce. >> > >> > And if a MR program works fine in pseudo-distributed cluster, will it >> work >> > exactly fine in the fully-distributed cluster ? >> > >> > >> > 2012/9/14 Kai Voigt <[EMAIL PROTECTED]> >> >> >> >> e default setting is that a tasktracker can run up to two map and >> reduce >> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and >> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some >> >> concurrency on your one machine. >> > >> > >> > >> > >> > -- >> > YANG, Lin >> > >> >> >> >> -- >> Harsh J >> > > > > -- > Bertrand Dechoux >
-- YANG, Lin
-
Re: What's the basic idea of pseudo-distributed Hadoop ?
Hemanth Yamijala 2012-09-14, 08:21
One thing to be careful about is paths of dependent libraries or executables like streaming binaries. In pseudo distributed mode, since all processes are looking on the same machine, it is likely that they will find paths that are really local to only the machine where the job is being launched from. When you start to run them in a true distributed environment, and if these files are not packaged and distributed to the cluster in some way, they will start failing.
Thanks hemanth
On Fri, Sep 14, 2012 at 1:04 PM, Jason Yang <[EMAIL PROTECTED]>wrote:
> All right, I got it. > > Thanks for all of you. > > > 2012/9/14 Bertrand Dechoux <[EMAIL PROTECTED]> > >> The only difference between pseudo-distributed and fully distributed >> would be scale. You could say that code that runs fine on the former, runs >> fine too on the latter. But it does not necessary mean that the performance >> will scale the same way (ie if you keep a list of elements in memory, at >> bigger scale you could receive OOME). >> >> Of course, like it has been implied in previous answers, you can't say >> the same with standalone. With this mode, you could use a global mutable >> static state thinking it's fine without caring about distribution between >> the nodes. In that case, the same code launched on pseudo-distributed will >> fail to replicate the same results. >> >> Regards >> >> Bertrand >> >> >> On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <[EMAIL PROTECTED]> wrote: >> >>> Hi Jason, >>> >>> I think you're confusing the standalone mode with a pseudo-distributed >>> mode. The former is a limited mode of MR where no daemons need to be >>> deployed and the tasks run in a single JVM (via threads). >>> >>> A pseudo distributed cluster is a cluster where all daemons are >>> running on one node itself. Hence, not "distributed" in the sense of >>> multi-nodes (no use of an network gear) but works in the same way >>> between nodes (RPC, etc.) as a fully-distributed one. >>> >>> If an MR program works fine in a pseudo-distributed mode, it "should" >>> work (no guarantee) fine in a fully-distributed mode iff all nodes >>> have the same arch/OS, same JVM, and job-specific configurations. This >>> is because tasks execute on various nodes and may be affected by the >>> node's behavior or setup that is different from others - and thats >>> something you'd have to detect/know about if it exhibits failures more >>> than others. >>> >>> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <[EMAIL PROTECTED]> >>> wrote: >>> > Hey, Kai >>> > >>> > Thanks for you reply. >>> > >>> > I was wondering what's difference btw the pseudo-distributed and >>> > fully-distributed hadoop, except the maximum number of map/reduce. >>> > >>> > And if a MR program works fine in pseudo-distributed cluster, will it >>> work >>> > exactly fine in the fully-distributed cluster ? >>> > >>> > >>> > 2012/9/14 Kai Voigt <[EMAIL PROTECTED]> >>> >> >>> >> e default setting is that a tasktracker can run up to two map and >>> reduce >>> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and >>> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see >>> some >>> >> concurrency on your one machine. >>> > >>> > >>> > >>> > >>> > -- >>> > YANG, Lin >>> > >>> >>> >>> >>> -- >>> Harsh J >>> >> >> >> >> -- >> Bertrand Dechoux >> > > > > -- > YANG, Lin > >
|
|