|
Dhaval Shah
2013-02-10, 01:24
David Koch
2013-02-10, 02:17
Marcos Ortiz
2013-02-10, 03:22
David Koch
2013-02-10, 12:51
shashwat shriparv
2013-02-10, 14:53
David Koch
2013-02-10, 20:11
ramkrishna vasudevan
2013-02-11, 03:58
David Koch
2013-02-11, 15:24
ramkrishna vasudevan
2013-02-11, 16:50
David Koch
2013-02-11, 22:14
David Koch
2013-02-10, 01:07
|
-
Re:: Region Servers crashing following: "File does not exist", "Too many open files" exceptionsDhaval Shah 2013-02-10, 01:24
It seems like you need to increase the limit on the number of xceivers on the hdfs config looking at your error messages. ------------------------------ On Sun 10 Feb, 2013 6:37 AM IST David Koch wrote: >Hello, > >As of lately, we have been having issues with Region Servers crashing in >our cluster. This happens while running Map/Reduce jobs over HBase tables >in particular but also spontaneously when the cluster is seemingly idle. > >Restarting the Region Servers or even HBase entirely as well as HDFS and >Map/Reduce services does not fix the problem and jobs will fail during the >next attempt citing "Region not served" exceptions. It is not always the >same nodes that crash. > >The log data during the minutes leading up to the crash contain many "File >does not exist /hbase/<table_name>/..." error messages which change to "Too >many open files" messages, finally, there are a few "Failed to renew lease >for DFSClient" messages followed by several "FATAL" messages about HLog not >being able to synch and immediately afterwards a terminal "ABORTING region >server". > >You can find an extract of a Region Server log here: >http://pastebin.com/G39LQyQT. > >Running "hbase hbck" reveals inconsistencies in some tables, but attempting >a repair with "hbase hbck -repair" stalls due to some regions being in >transition, see here: http://pastebin.com/JAbcQ4cc. > >The setup contains 30 machines, 26GB RAM each, the services are managed >using CDH4, so HBase version is 0.92.x. We did not tweak any of the default >configuration settings, however table scans are done with sensible >scan/batch/filter settings. > >Data intake is about 100GB/day which are added at a time when no Map/Reduce >jobs are running. Tables have between 100 * 10^6 and 2 * 10^9 rows, with an >average of 10 KVs, about 1kb each. Very few rows exceed 10^6 KV. > >What can we do to fix these issues? Are they symptomic of a mal-configured >setup or some critical threshold level being reached? The cluster used to >be stable. > >Thank you, > >/David +
Dhaval Shah 2013-02-10, 01:24
-
Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptionsDavid Koch 2013-02-10, 02:17
Hello,
Thank you for your reply, I checked the HDFS log for error messages that are indicative of "xciever" problems but could not find any. The settings suggested here: http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/have been applied on our cluster. I did a grep "File does not exist: /hbase/<table_name>/" /var/log/hadoop-hdfs/hadoop-cmf-hdfs1-NAMENODE-big* | wc on the namenode logs and there millions of such lines for one table only. The count is 0 for all other tables - even though they may be reported as inconsistent by hbchk. It seems like this is less of a performance issue but rather some stale "where to find what data" problem - possibly related to Zookeeper? I remember there being some kind of procedure for clearing ZK even though I cannot recall the steps involved. Any further help would be appreciated, Thanks, /David On Sun, Feb 10, 2013 at 2:24 AM, Dhaval Shah <[EMAIL PROTECTED]>wrote: > > It seems like you need to increase the limit on the number of xceivers on > the hdfs config looking at your error messages. > > > ------------------------------ > On Sun 10 Feb, 2013 6:37 AM IST David Koch wrote: > > >Hello, > > > >As of lately, we have been having issues with Region Servers crashing in > >our cluster. This happens while running Map/Reduce jobs over HBase tables > >in particular but also spontaneously when the cluster is seemingly idle. > > > >Restarting the Region Servers or even HBase entirely as well as HDFS and > >Map/Reduce services does not fix the problem and jobs will fail during the > >next attempt citing "Region not served" exceptions. It is not always the > >same nodes that crash. > > > >The log data during the minutes leading up to the crash contain many "File > >does not exist /hbase/<table_name>/..." error messages which change to > "Too > >many open files" messages, finally, there are a few "Failed to renew lease > >for DFSClient" messages followed by several "FATAL" messages about HLog > not > >being able to synch and immediately afterwards a terminal "ABORTING region > >server". > > > >You can find an extract of a Region Server log here: > >http://pastebin.com/G39LQyQT. > > > >Running "hbase hbck" reveals inconsistencies in some tables, but > attempting > >a repair with "hbase hbck -repair" stalls due to some regions being in > >transition, see here: http://pastebin.com/JAbcQ4cc. > > > >The setup contains 30 machines, 26GB RAM each, the services are managed > >using CDH4, so HBase version is 0.92.x. We did not tweak any of the > default > >configuration settings, however table scans are done with sensible > >scan/batch/filter settings. > > > >Data intake is about 100GB/day which are added at a time when no > Map/Reduce > >jobs are running. Tables have between 100 * 10^6 and 2 * 10^9 rows, with > an > >average of 10 KVs, about 1kb each. Very few rows exceed 10^6 KV. > > > >What can we do to fix these issues? Are they symptomic of a mal-configured > >setup or some critical threshold level being reached? The cluster used to > >be stable. > > > >Thank you, > > > >/David > > +
David Koch 2013-02-10, 02:17
-
Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptionsMarcos Ortiz 2013-02-10, 03:22
Did you increase the number of open files in your
/etc/security/limits.conf in your system? On 02/09/2013 09:17 PM, David Koch wrote: > Hello, > > Thank you for your reply, I checked the HDFS log for error messages that > are indicative of "xciever" problems but could not find any. The settings > suggested here: > http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/have been > applied on our cluster. > > I did a grep "File does not exist: /hbase/<table_name>/" > /var/log/hadoop-hdfs/hadoop-cmf-hdfs1-NAMENODE-big* | wc > > on the namenode logs and there millions of such lines for one table only. > The count is 0 for all other tables - even though they may be reported as > inconsistent by hbchk. > > It seems like this is less of a performance issue but rather some stale > "where to find what data" problem - possibly related to Zookeeper? I > remember there being some kind of procedure for clearing ZK even though I > cannot recall the steps involved. > > Any further help would be appreciated, > > Thanks, > > /David > > On Sun, Feb 10, 2013 at 2:24 AM, Dhaval Shah <[EMAIL PROTECTED]>wrote: > >> It seems like you need to increase the limit on the number of xceivers on >> the hdfs config looking at your error messages. >> >> >> ------------------------------ >> On Sun 10 Feb, 2013 6:37 AM IST David Koch wrote: >> >>> Hello, >>> >>> As of lately, we have been having issues with Region Servers crashing in >>> our cluster. This happens while running Map/Reduce jobs over HBase tables >>> in particular but also spontaneously when the cluster is seemingly idle. >>> >>> Restarting the Region Servers or even HBase entirely as well as HDFS and >>> Map/Reduce services does not fix the problem and jobs will fail during the >>> next attempt citing "Region not served" exceptions. It is not always the >>> same nodes that crash. >>> >>> The log data during the minutes leading up to the crash contain many "File >>> does not exist /hbase/<table_name>/..." error messages which change to >> "Too >>> many open files" messages, finally, there are a few "Failed to renew lease >>> for DFSClient" messages followed by several "FATAL" messages about HLog >> not >>> being able to synch and immediately afterwards a terminal "ABORTING region >>> server". >>> >>> You can find an extract of a Region Server log here: >>> http://pastebin.com/G39LQyQT. >>> >>> Running "hbase hbck" reveals inconsistencies in some tables, but >> attempting >>> a repair with "hbase hbck -repair" stalls due to some regions being in >>> transition, see here: http://pastebin.com/JAbcQ4cc. >>> >>> The setup contains 30 machines, 26GB RAM each, the services are managed >>> using CDH4, so HBase version is 0.92.x. We did not tweak any of the >> default >>> configuration settings, however table scans are done with sensible >>> scan/batch/filter settings. >>> >>> Data intake is about 100GB/day which are added at a time when no >> Map/Reduce >>> jobs are running. Tables have between 100 * 10^6 and 2 * 10^9 rows, with >> an >>> average of 10 KVs, about 1kb each. Very few rows exceed 10^6 KV. >>> >>> What can we do to fix these issues? Are they symptomic of a mal-configured >>> setup or some critical threshold level being reached? The cluster used to >>> be stable. >>> >>> Thank you, >>> >>> /David >> -- Marcos Ortiz Valmaseda, Product Manager && Data Scientist at UCI Blog: http://marcosluis2186.posterous.com Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186> +
Marcos Ortiz 2013-02-10, 03:22
-
Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptionsDavid Koch 2013-02-10, 12:51
Yes, the limit is at 65535.
/David On Sun, Feb 10, 2013 at 4:22 AM, Marcos Ortiz <[EMAIL PROTECTED]> wrote: > Did you increase the number of open files in your > /etc/security/limits.conf in your system? > > > On 02/09/2013 09:17 PM, David Koch wrote: > > Hello, > > Thank you for your reply, I checked the HDFS log for error messages that > are indicative of "xciever" problems but could not find any. The settings > suggested here: > http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/have been > applied on our cluster. > > I did a grep "File does not exist: /hbase/<table_name>/" > /var/log/hadoop-hdfs/hadoop-cmf-hdfs1-NAMENODE-big* | wc > > on the namenode logs and there millions of such lines for one table only. > The count is 0 for all other tables - even though they may be reported as > inconsistent by hbchk. > > It seems like this is less of a performance issue but rather some stale > "where to find what data" problem - possibly related to Zookeeper? I > remember there being some kind of procedure for clearing ZK even though I > cannot recall the steps involved. > > Any further help would be appreciated, > > Thanks, > > /David > > On Sun, Feb 10, 2013 at 2:24 AM, Dhaval Shah <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>wrote: > > > It seems like you need to increase the limit on the number of xceivers on > the hdfs config looking at your error messages. > > > ------------------------------ > On Sun 10 Feb, 2013 6:37 AM IST David Koch wrote: > > > Hello, > > As of lately, we have been having issues with Region Servers crashing in > our cluster. This happens while running Map/Reduce jobs over HBase tables > in particular but also spontaneously when the cluster is seemingly idle. > > Restarting the Region Servers or even HBase entirely as well as HDFS and > Map/Reduce services does not fix the problem and jobs will fail during the > next attempt citing "Region not served" exceptions. It is not always the > same nodes that crash. > > The log data during the minutes leading up to the crash contain many "File > does not exist /hbase/<table_name>/..." error messages which change to > > "Too > > many open files" messages, finally, there are a few "Failed to renew lease > for DFSClient" messages followed by several "FATAL" messages about HLog > > not > > being able to synch and immediately afterwards a terminal "ABORTING region > server". > > You can find an extract of a Region Server log here:http://pastebin.com/G39LQyQT. > > Running "hbase hbck" reveals inconsistencies in some tables, but > > attempting > > a repair with "hbase hbck -repair" stalls due to some regions being in > transition, see here: http://pastebin.com/JAbcQ4cc. > > The setup contains 30 machines, 26GB RAM each, the services are managed > using CDH4, so HBase version is 0.92.x. We did not tweak any of the > > default > > configuration settings, however table scans are done with sensible > scan/batch/filter settings. > > Data intake is about 100GB/day which are added at a time when no > > Map/Reduce > > jobs are running. Tables have between 100 * 10^6 and 2 * 10^9 rows, with > > an > > average of 10 KVs, about 1kb each. Very few rows exceed 10^6 KV. > > What can we do to fix these issues? Are they symptomic of a mal-configured > setup or some critical threshold level being reached? The cluster used to > be stable. > > Thank you, > > /David > > > -- > Marcos Ortiz Valmaseda, > Product Manager && Data Scientist at UCI > Blog: http://marcosluis2186.posterous.com > Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186> > +
David Koch 2013-02-10, 12:51
-
Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptionsshashwat shriparv 2013-02-10, 14:53
On Sun, Feb 10, 2013 at 6:21 PM, David Koch <[EMAIL PROTECTED]> wrote:
> problems but could not find any. The settings increase the u limit for the user using you are starting the hadoop and hbase services, in os ∞ Shashwat Shriparv +
shashwat shriparv 2013-02-10, 14:53
-
Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptionsDavid Koch 2013-02-10, 20:11
Like I said, the maximum permissible number of filehandlers is set to 65535
for users hbase (the one who starts HBase), mapred and hdfs The too many files warning occurs on the region servers but not on the HDFS namenode. /David On Sun, Feb 10, 2013 at 3:53 PM, shashwat shriparv < [EMAIL PROTECTED]> wrote: > On Sun, Feb 10, 2013 at 6:21 PM, David Koch <[EMAIL PROTECTED]> wrote: > > > problems but could not find any. The settings > > > increase the u limit for the user using you are starting the hadoop and > hbase services, in os > > > > ∞ > Shashwat Shriparv > +
David Koch 2013-02-10, 20:11
-
Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptionsramkrishna vasudevan 2013-02-11, 03:58
Hi David,
Have you changed anything on the configurations related to compactions? If there are more store files created and if the compactions are not run frequently we end up in this problem. Atleast there will be a consistent increase in the file handler count. Could you run compactions manually to see if it helps? Regards Ram On Mon, Feb 11, 2013 at 1:41 AM, David Koch <[EMAIL PROTECTED]> wrote: > Like I said, the maximum permissible number of filehandlers is set to 65535 > for users hbase (the one who starts HBase), mapred and hdfs > > The too many files warning occurs on the region servers but not on the HDFS > namenode. > > /David > > > On Sun, Feb 10, 2013 at 3:53 PM, shashwat shriparv < > [EMAIL PROTECTED]> wrote: > > > On Sun, Feb 10, 2013 at 6:21 PM, David Koch <[EMAIL PROTECTED]> > wrote: > > > > > problems but could not find any. The settings > > > > > > increase the u limit for the user using you are starting the hadoop and > > hbase services, in os > > > > > > > > ∞ > > Shashwat Shriparv > > > +
ramkrishna vasudevan 2013-02-11, 03:58
-
Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptionsDavid Koch 2013-02-11, 15:24
Hello,
No, we did not change anything, so compactions should run at automatically - I guess it's once a day - however, I don't know to what extent jobs running on the cluster have impeded compactions - if this is even a possibility. /David On Mon, Feb 11, 2013 at 4:58 AM, ramkrishna vasudevan < [EMAIL PROTECTED]> wrote: > Hi David, > > Have you changed anything on the configurations related to compactions? > > If there are more store files created and if the compactions are not run > frequently we end up in this problem. Atleast there will be a consistent > increase in the file handler count. > > Could you run compactions manually to see if it helps? > > Regards > Ram > > On Mon, Feb 11, 2013 at 1:41 AM, David Koch <[EMAIL PROTECTED]> wrote: > > > Like I said, the maximum permissible number of filehandlers is set to > 65535 > > for users hbase (the one who starts HBase), mapred and hdfs > > > > The too many files warning occurs on the region servers but not on the > HDFS > > namenode. > > > > /David > > > > > > On Sun, Feb 10, 2013 at 3:53 PM, shashwat shriparv < > > [EMAIL PROTECTED]> wrote: > > > > > On Sun, Feb 10, 2013 at 6:21 PM, David Koch <[EMAIL PROTECTED]> > > wrote: > > > > > > > problems but could not find any. The settings > > > > > > > > > increase the u limit for the user using you are starting the hadoop and > > > hbase services, in os > > > > > > > > > > > > ∞ > > > Shashwat Shriparv > > > > > > +
David Koch 2013-02-11, 15:24
-
Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptionsramkrishna vasudevan 2013-02-11, 16:50
>From the UI can you figure out how many store files are present? Also if
you can check the logs it will tel you if the compactions were happening. I may be wrong without checking your cluster, just some inputs that we have faced sometime back. Regards Ram On Mon, Feb 11, 2013 at 8:54 PM, David Koch <[EMAIL PROTECTED]> wrote: > Hello, > > No, we did not change anything, so compactions should run at automatically > - I guess it's once a day - however, I don't know to what extent jobs > running on the cluster have impeded compactions - if this is even a > possibility. > > /David > > On Mon, Feb 11, 2013 at 4:58 AM, ramkrishna vasudevan < > [EMAIL PROTECTED]> wrote: > > > Hi David, > > > > Have you changed anything on the configurations related to compactions? > > > > If there are more store files created and if the compactions are not run > > frequently we end up in this problem. Atleast there will be a consistent > > increase in the file handler count. > > > > Could you run compactions manually to see if it helps? > > > > Regards > > Ram > > > > On Mon, Feb 11, 2013 at 1:41 AM, David Koch <[EMAIL PROTECTED]> > wrote: > > > > > Like I said, the maximum permissible number of filehandlers is set to > > 65535 > > > for users hbase (the one who starts HBase), mapred and hdfs > > > > > > The too many files warning occurs on the region servers but not on the > > HDFS > > > namenode. > > > > > > /David > > > > > > > > > On Sun, Feb 10, 2013 at 3:53 PM, shashwat shriparv < > > > [EMAIL PROTECTED]> wrote: > > > > > > > On Sun, Feb 10, 2013 at 6:21 PM, David Koch <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > problems but could not find any. The settings > > > > > > > > > > > > increase the u limit for the user using you are starting the hadoop > and > > > > hbase services, in os > > > > > > > > > > > > > > > > ∞ > > > > Shashwat Shriparv > > > > > > > > > > +
ramkrishna vasudevan 2013-02-11, 16:50
-
Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptionsDavid Koch 2013-02-11, 22:14
Hello,
Thank you for your replies. In the end we dropped the concerned tables and are in the process of re-importing data. Looking through the mailing list it seems like this issue [1] may be identical to what we are experiencing. TLDR: Region splits fail when there is a lack of disk space, leaving some orphan references to non-existant regions which HBase tries to access viciously, exhausting file handlers in the process thereby degrading Region Server performance. There is a JIRA for this [2]. We looked for references to said files and deleted them but we must have missed something because hbase hbck -repair still stalls. In any case, our bad for letting the cluster get to the point where there was hardly any disk space. If someone reading this has experienced the same problem but managed to restore order without resorting to drastic measures such as dropping a table I'd be curious to know about the steps that were taken. Thank you, /David [1] http://mail-archives.apache.org/mod_mbox/hbase-user/201212.mbox/%3CCAO=qdPQ1jJaaXCt2CVpHZev7q-QHR1x4D+[EMAIL PROTECTED]%3E [2] https://issues.apache.org/jira/browse/HBASE-7335 On Mon, Feb 11, 2013 at 5:50 PM, ramkrishna vasudevan < [EMAIL PROTECTED]> wrote: > From the UI can you figure out how many store files are present? Also if > you can check the logs it will tel you if the compactions were happening. > I may be wrong without checking your cluster, just some inputs that we have > faced sometime back. > > Regards > Ram > > On Mon, Feb 11, 2013 at 8:54 PM, David Koch <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > No, we did not change anything, so compactions should run at > automatically > > - I guess it's once a day - however, I don't know to what extent jobs > > running on the cluster have impeded compactions - if this is even a > > possibility. > > > > /David > > > > On Mon, Feb 11, 2013 at 4:58 AM, ramkrishna vasudevan < > > [EMAIL PROTECTED]> wrote: > > > > > Hi David, > > > > > > Have you changed anything on the configurations related to compactions? > > > > > > If there are more store files created and if the compactions are not > run > > > frequently we end up in this problem. Atleast there will be a > consistent > > > increase in the file handler count. > > > > > > Could you run compactions manually to see if it helps? > > > > > > Regards > > > Ram > > > > > > On Mon, Feb 11, 2013 at 1:41 AM, David Koch <[EMAIL PROTECTED]> > > wrote: > > > > > > > Like I said, the maximum permissible number of filehandlers is set to > > > 65535 > > > > for users hbase (the one who starts HBase), mapred and hdfs > > > > > > > > The too many files warning occurs on the region servers but not on > the > > > HDFS > > > > namenode. > > > > > > > > /David > > > > > > > > > > > > On Sun, Feb 10, 2013 at 3:53 PM, shashwat shriparv < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > On Sun, Feb 10, 2013 at 6:21 PM, David Koch <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > > > problems but could not find any. The settings > > > > > > > > > > > > > > > increase the u limit for the user using you are starting the hadoop > > and > > > > > hbase services, in os > > > > > > > > > > > > > > > > > > > > ∞ > > > > > Shashwat Shriparv > > > > > > > > > > > > > > > +
David Koch 2013-02-11, 22:14
-
Region Servers crashing following: "File does not exist", "Too many open files" exceptionsDavid Koch 2013-02-10, 01:07
Hello,
As of lately, we have been having issues with Region Servers crashing in our cluster. This happens while running Map/Reduce jobs over HBase tables in particular but also spontaneously when the cluster is seemingly idle. Restarting the Region Servers or even HBase entirely as well as HDFS and Map/Reduce services does not fix the problem and jobs will fail during the next attempt citing "Region not served" exceptions. It is not always the same nodes that crash. The log data during the minutes leading up to the crash contain many "File does not exist /hbase/<table_name>/..." error messages which change to "Too many open files" messages, finally, there are a few "Failed to renew lease for DFSClient" messages followed by several "FATAL" messages about HLog not being able to synch and immediately afterwards a terminal "ABORTING region server". You can find an extract of a Region Server log here: http://pastebin.com/G39LQyQT. Running "hbase hbck" reveals inconsistencies in some tables, but attempting a repair with "hbase hbck -repair" stalls due to some regions being in transition, see here: http://pastebin.com/JAbcQ4cc. The setup contains 30 machines, 26GB RAM each, the services are managed using CDH4, so HBase version is 0.92.x. We did not tweak any of the default configuration settings, however table scans are done with sensible scan/batch/filter settings. Data intake is about 100GB/day which are added at a time when no Map/Reduce jobs are running. Tables have between 100 * 10^6 and 2 * 10^9 rows, with an average of 10 KVs, about 1kb each. Very few rows exceed 10^6 KV. What can we do to fix these issues? Are they symptomic of a mal-configured setup or some critical threshold level being reached? The cluster used to be stable. Thank you, /David +
David Koch 2013-02-10, 01:07
|