MapReduce, mail # user - What's the basic idea of pseudo-distributed Hadoop ?

Re: What's the basic idea of pseudo-distributed Hadoop ?
Bertrand Dechoux 2012-09-14, 07:31
The only difference between pseudo-distributed and fully distributed would
be scale. You could say that code that runs fine on the former, runs fine
too on the latter. But it does not necessary mean that the performance will
scale the same way (ie if you keep a list of elements in memory, at bigger
scale you could receive OOME).

Of course, like it has been implied in previous answers, you can't say the
same with standalone. With this mode, you could use a global mutable static
state thinking it's fine without caring about distribution between the
nodes. In that case, the same code launched on pseudo-distributed will fail
to replicate the same results.



On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi Jason,
> I think you're confusing the standalone mode with a pseudo-distributed
> mode. The former is a limited mode of MR where no daemons need to be
> deployed and the tasks run in a single JVM (via threads).
> A pseudo distributed cluster is a cluster where all daemons are
> running on one node itself. Hence, not "distributed" in the sense of
> multi-nodes (no use of an network gear) but works in the same way
> between nodes (RPC, etc.) as a fully-distributed one.
> If an MR program works fine in a pseudo-distributed mode, it "should"
> work (no guarantee) fine in a fully-distributed mode iff all nodes
> have the same arch/OS, same JVM, and job-specific configurations. This
> is because tasks execute on various nodes and may be affected by the
> node's behavior or setup that is different from others - and thats
> something you'd have to detect/know about if it exhibits failures more
> than others.
> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <[EMAIL PROTECTED]>
> wrote:
> > 2012/9/14 Kai Voigt <[EMAIL PROTECTED]>
> >> e default setting is that a tasktracker can run up to two map and reduce
> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
> >> concurrency on your one machine.
> >
Bertrand Dechoux