Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Problem with cluster


Copy link to this message
-
Re: Problem with cluster
Ravi Prakash 2012-05-03, 23:07
Hi Pat,

20.205 is the stable version before 1.0. 1.0 is not substantially different
than 0.20. Any reasons you don't wanna use it?

I don't think "occasional HDFS corruption" is a known issue. That would be,
umm... lets just say pretty severe. Are you sure you've configured it
properly?

Your task is killing the Hadoop daemons? :-o You might wanna check with the
developers of Mahout / bixo if that is a known issue. Obviously it should
not happen. Hadoop daemons are known to be quite long lasting (many months
atleast), and there are ways you can setup security to prevent tasks from
doing that (but guessing you have 2 nodes, maybe you don't want to invest
in that)

The message is displayed when the DN is trying to shut down but cannot
because it is waiting on some (apparently 1) thread.

HTH
Ravi

On Thu, May 3, 2012 at 12:09 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote:

> I'm trying to use a small cluster to make sure I understand the setup and
> have my code running before going to a big cluster. I have two machines.
> I've followed the tutorial here: http://www.michael-noll.com/**
> tutorials/running-hadoop-on-**ubuntu-linux-multi-node-**cluster/<http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/>I have been using 0.20.203 -- is this the most stable version of pre-1.0
> code?
>
> The cluster seemed fine for some time except for the occasional HDFS
> corruption, a know issue. I have run  mostly mahout code unaltered with
> success.
>
> However I am now getting some consistent errors with mahout and bixo (only
> recently started using this). When I start a job from the master, say a
> command line mahout job, the slave dies pretty quickly. It looks like
> spawned threads never complete and kill the slave. Hadoop may recover or it
> may not depending on what it is doing.
>
> In any case when I go to the slave and do ps -e I get a huge list of
>
>   "fuser <defunct>" with a long list of pids.
>
>
> The datanode logs on the slave have this warning:
>
>   pat@occam:~$ tail -f
>   hadoop-0.20.203.0/logs/hadoop-**pat-datanode-occam.log
>   2012-05-03 08:39:39,035 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   2012-05-03 08:39:40,035 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   2012-05-03 08:39:41,035 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   2012-05-03 08:39:42,036 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   etc....
>
> So far I have removed the slave from the master's config and set
> replication to 1 and all works, just slower.
>
> Any ideas? and should I upgrade to a newer version?
>
>
>
>