Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re:: Region Servers crashing following: "File does not exist", "Too many open files" exceptions


+
Dhaval Shah 2013-02-10, 01:24
+
David Koch 2013-02-10, 02:17
Copy link to this message
-
Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptions
Did you increase the number of open files in your
/etc/security/limits.conf in your system?

On 02/09/2013 09:17 PM, David Koch wrote:
> Hello,
>
> Thank you for your reply, I checked the HDFS log for error messages that
> are indicative of "xciever" problems but could not find any. The settings
> suggested here:
> http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/have been
> applied on our cluster.
>
> I did a grep "File does not exist: /hbase/<table_name>/"
> /var/log/hadoop-hdfs/hadoop-cmf-hdfs1-NAMENODE-big* | wc
>
> on the namenode logs and there millions of such lines for one table only.
> The count is 0 for all other tables - even though they may be reported as
> inconsistent by hbchk.
>
> It seems like this is less of a performance issue but rather some stale
> "where to find what data" problem - possibly related to Zookeeper? I
> remember there being some kind of procedure for clearing ZK even though I
> cannot recall the steps involved.
>
> Any further help would be appreciated,
>
> Thanks,
>
> /David
>
> On Sun, Feb 10, 2013 at 2:24 AM, Dhaval Shah <[EMAIL PROTECTED]>wrote:
>
>> It seems like you need to increase the limit on the number of xceivers on
>> the hdfs config looking at your error messages.
>>
>>
>> ------------------------------
>> On Sun 10 Feb, 2013 6:37 AM IST David Koch wrote:
>>
>>> Hello,
>>>
>>> As of lately, we have been having issues with Region Servers crashing in
>>> our cluster. This happens while running Map/Reduce jobs over HBase tables
>>> in particular but also spontaneously when the cluster is seemingly idle.
>>>
>>> Restarting the Region Servers or even HBase entirely as well as HDFS and
>>> Map/Reduce services does not fix the problem and jobs will fail during the
>>> next attempt citing "Region not served" exceptions. It is not always the
>>> same nodes that crash.
>>>
>>> The log data during the minutes leading up to the crash contain many "File
>>> does not exist /hbase/<table_name>/..." error messages which change to
>> "Too
>>> many open files" messages, finally, there are a few "Failed to renew lease
>>> for DFSClient" messages followed by several "FATAL" messages about HLog
>> not
>>> being able to synch and immediately afterwards a terminal "ABORTING region
>>> server".
>>>
>>> You can find an extract of a Region Server log here:
>>> http://pastebin.com/G39LQyQT.
>>>
>>> Running "hbase hbck" reveals inconsistencies in some tables, but
>> attempting
>>> a repair with "hbase hbck -repair" stalls due to some regions being in
>>> transition, see here: http://pastebin.com/JAbcQ4cc.
>>>
>>> The setup contains 30 machines, 26GB RAM each, the services are managed
>>> using CDH4, so HBase version is 0.92.x. We did not tweak any of the
>> default
>>> configuration settings, however table scans are done with sensible
>>> scan/batch/filter settings.
>>>
>>> Data intake is about 100GB/day which are added at a time when no
>> Map/Reduce
>>> jobs are running. Tables have between 100 * 10^6 and 2 * 10^9 rows, with
>> an
>>> average of 10 KVs, about 1kb each. Very few rows exceed 10^6 KV.
>>>
>>> What can we do to fix these issues? Are they symptomic of a mal-configured
>>> setup or some critical threshold level being reached? The cluster used to
>>> be stable.
>>>
>>> Thank you,
>>>
>>> /David
>>

--
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>
+
David Koch 2013-02-10, 12:51
+
shashwat shriparv 2013-02-10, 14:53
+
David Koch 2013-02-10, 20:11
+
ramkrishna vasudevan 2013-02-11, 03:58
+
David Koch 2013-02-11, 15:24
+
ramkrishna vasudevan 2013-02-11, 16:50
+
David Koch 2013-02-11, 22:14
+
David Koch 2013-02-10, 01:07
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB