Hadoop, mail # user - Socket closed Exception

Re: Socket closed Exception
lohit 2009-03-30, 15:50

Thanks Koji.
If I look at the code, NameNode (RPC Server) seems to tear down idle connections. Did you see 'Socket closed' exception instead of 'timed out waiting for socket'?
We seem to hit the 'Socket closed' exception where client do not timeout, but get back socket closed exception when they do RPC for create/open/getFileInfo.

I will give this a try. Thanks again,

From: Koji Noguchi <[EMAIL PROTECTED]>
Sent: Sunday, March 29, 2009 11:44:29 PM
Hi Lohit,

My initial guess would be

When this happened on our 0.17 cluster, all of our (task) clients were
using the max idle time of 1 hour due to this bug instead of the
configured value of a few seconds.
Thus each client kept the connection up much longer than we expected.
(Not sure if this applies to your 0.15 cluster, but it sounds similar to
what we observed.)

This worked until namenode started hitting the max limit of '

  <description>Defines the threshold number of connections after which
               connections will be inspected for idleness.

When inspecting for idleness, namenode uses

  <description>Defines the maximum idle time for a connected client
               after which it may be disconnected.

As a result, many connections got disconnected at once.
Clients only see the timeouts when they try to re-use that sockets the
next time and wait for 1 minute.  That's why they are not exactly at the
same time, but *almost* the same time.
# If this solves your problem, Raghu should get the credit.
  He spent so many hours to solve this mystery for us. :)
From: lohit [mailto:[EMAIL PROTECTED]]
Sent: Sunday, March 29, 2009 11:56 AM
Recently we are seeing lot of Socket closed exception in our cluster.
Many task's open/create/getFileInfo calls get back 'SocketException'
with message 'Socket closed'. We seem to see many tasks fail with same
error around same time. There are no warning or info messages in
NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases
where NameNode closes socket due heavy load or during conention of
resource of anykind?