|
Mike Hugo
2013-02-27, 17:10
John Vines
2013-02-27, 17:12
Eric Newton
2013-02-27, 17:15
Adam Fuchs
2013-02-27, 17:24
Mike Hugo
2013-02-27, 20:17
Adam Fuchs
2013-02-27, 20:27
John Vines
2013-02-27, 20:32
Christopher
2013-02-27, 22:46
Josh Elser
2013-02-27, 23:23
|
-
Determining the cause of a tablet server failureMike Hugo 2013-02-27, 17:10
After running an ingest process via map reduce for about an hour or so, one
of our tserver fails. It happens pretty consistently, we're able to replicate it without too much difficulty. I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why the tserver fails, but I'm not seeing much that points to a cause of the tserver going offline. One minute it's there, the next it's offline. There are some warnings about the swappiness as well as a large row that cannot be spit but other than that, not much else to go on. Is there anything that could help me figure out *why* the tserver died? I'm guessing it's something in our client code or a config that's not correct on the server, but it'd be really nice to have a hint before we start randomly changing things to see what will fix it. Thanks, Mike
-
Re: Determining the cause of a tablet server failureJohn Vines 2013-02-27, 17:12
Check the .out and .err files. Out of Memory exceptions aren't caught by
log4j and instead go to those files. On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: > After running an ingest process via map reduce for about an hour or so, > one of our tserver fails. It happens pretty consistently, we're able to > replicate it without too much difficulty. > > I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why the > tserver fails, but I'm not seeing much that points to a cause of the > tserver going offline. One minute it's there, the next it's offline. > There are some warnings about the swappiness as well as a large row that > cannot be spit but other than that, not much else to go on. > > Is there anything that could help me figure out *why* the tserver died? > I'm guessing it's something in our client code or a config that's not > correct on the server, but it'd be really nice to have a hint before we > start randomly changing things to see what will fix it. > > Thanks, > > Mike >
-
Re: Determining the cause of a tablet server failureEric Newton 2013-02-27, 17:15
Also, check the "gc" lines from the debug logs:
$ grep -a 'gc' logs/tserver*.debug.log They should come about one-per-second. You may see pauses due to swapping out or low memory. -Eric On Wed, Feb 27, 2013 at 12:12 PM, John Vines <[EMAIL PROTECTED]> wrote: > Check the .out and .err files. Out of Memory exceptions aren't caught by > log4j and instead go to those files. > > > On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: > >> After running an ingest process via map reduce for about an hour or so, >> one of our tserver fails. It happens pretty consistently, we're able to >> replicate it without too much difficulty. >> >> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why the >> tserver fails, but I'm not seeing much that points to a cause of the >> tserver going offline. One minute it's there, the next it's offline. >> There are some warnings about the swappiness as well as a large row that >> cannot be spit but other than that, not much else to go on. >> >> Is there anything that could help me figure out *why* the tserver died? >> I'm guessing it's something in our client code or a config that's not >> correct on the server, but it'd be really nice to have a hint before we >> start randomly changing things to see what will fix it. >> >> Thanks, >> >> Mike >> > >
-
Re: Determining the cause of a tablet server failureAdam Fuchs 2013-02-27, 17:24
There are a few primary reasons why your tablet server would die:
1. Lost lock in Zookeeper. If the tablet server and zookeeper can't communicate with each other then the lock will timeout and the tablet server will kill itself. This should show up as several messages in the tserver log. If this happens when a tablet server is really busy (lots of threads doing stuff) then the log message about the lost lock can be pretty far back in the queue. Java garbage collection can cause long pauses that inhibit the tserver/zookeeper messages. Zookeeper can also get overwhelmed and behave poorly if the server it's running on swaps it out. 2. Problems talking with the master. If a tablet server is too slow in communicating with the master then the master will try to kill it. This should show up in the master log, and also will be noted in the tserver log. 3. Out of memory. If the tserver JVM runs out of memory it will terminate. As John mentioned, this will be in the .err or .out files in the log directory. Adam On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: > After running an ingest process via map reduce for about an hour or so, > one of our tserver fails. It happens pretty consistently, we're able to > replicate it without too much difficulty. > > I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why the > tserver fails, but I'm not seeing much that points to a cause of the > tserver going offline. One minute it's there, the next it's offline. > There are some warnings about the swappiness as well as a large row that > cannot be spit but other than that, not much else to go on. > > Is there anything that could help me figure out *why* the tserver died? > I'm guessing it's something in our client code or a config that's not > correct on the server, but it'd be really nice to have a hint before we > start randomly changing things to see what will fix it. > > Thanks, > > Mike >
-
Re: Determining the cause of a tablet server failureMike Hugo 2013-02-27, 20:17
I'm chalking this up to a mis-configured server. It looks like during the
install on this server the accumulo-env.sh file was copied from the examples, but rather than setting editing it to set the JAVA_HOME, HADOOP_HOME, and ZOOKEEPER_HOME, the entire file contents were replaced with those env variables. I'm assuming this caused us to pick up the default (?) _OPTS settings rather than the correct ones we should have been getting based on our server memory capacity from the examples. So we had a bunch of accumulo related java processes all running with memory settings that were way out of whack from what they should have been. To solve it I copied in the files from the conf/examples directory again and made sure everything was set up correctly and restarted everything. We never did see anything in out log files or .out / .err logs indicating the source of the problem, but the above is my best guess as to what was going on. Thanks again for all the tips and pointers! Mike On Wed, Feb 27, 2013 at 11:24 AM, Adam Fuchs <[EMAIL PROTECTED]> wrote: > There are a few primary reasons why your tablet server would die: > 1. Lost lock in Zookeeper. If the tablet server and zookeeper can't > communicate with each other then the lock will timeout and the tablet > server will kill itself. This should show up as several messages in the > tserver log. If this happens when a tablet server is really busy (lots of > threads doing stuff) then the log message about the lost lock can be pretty > far back in the queue. Java garbage collection can cause long pauses that > inhibit the tserver/zookeeper messages. Zookeeper can also get overwhelmed > and behave poorly if the server it's running on swaps it out. > 2. Problems talking with the master. If a tablet server is too slow in > communicating with the master then the master will try to kill it. This > should show up in the master log, and also will be noted in the tserver log. > 3. Out of memory. If the tserver JVM runs out of memory it will terminate. > As John mentioned, this will be in the .err or .out files in the log > directory. > > Adam > > > > On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: > >> After running an ingest process via map reduce for about an hour or so, >> one of our tserver fails. It happens pretty consistently, we're able to >> replicate it without too much difficulty. >> >> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why the >> tserver fails, but I'm not seeing much that points to a cause of the >> tserver going offline. One minute it's there, the next it's offline. >> There are some warnings about the swappiness as well as a large row that >> cannot be spit but other than that, not much else to go on. >> >> Is there anything that could help me figure out *why* the tserver died? >> I'm guessing it's something in our client code or a config that's not >> correct on the server, but it'd be really nice to have a hint before we >> start randomly changing things to see what will fix it. >> >> Thanks, >> >> Mike >> > >
-
Re: Determining the cause of a tablet server failureAdam Fuchs 2013-02-27, 20:27
So, question for the community: inside bin/accumulo we have:
-XX:OnOutOfMemoryError="kill -9 %p" Should this also append a log message? Something like: -XX:OnOutOfMemoryError="kill -9 %p; echo "ran out of memory >> logfilename" Is this necessary, or should the OutOfMemoryException still find its way to the regular log? Adam On Wed, Feb 27, 2013 at 3:17 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: > I'm chalking this up to a mis-configured server. It looks like during the > install on this server the accumulo-env.sh file was copied from the > examples, but rather than setting editing it to set the JAVA_HOME, > HADOOP_HOME, and ZOOKEEPER_HOME, the entire file contents were replaced > with those env variables. > > I'm assuming this caused us to pick up the default (?) _OPTS settings > rather than the correct ones we should have been getting based on our > server memory capacity from the examples. So we had a bunch of accumulo > related java processes all running with memory settings that were way out > of whack from what they should have been. > > To solve it I copied in the files from the conf/examples directory again > and made sure everything was set up correctly and restarted everything. > > We never did see anything in out log files or .out / .err logs indicating > the source of the problem, but the above is my best guess as to what was > going on. > > Thanks again for all the tips and pointers! > > Mike > > > On Wed, Feb 27, 2013 at 11:24 AM, Adam Fuchs <[EMAIL PROTECTED]> wrote: > >> There are a few primary reasons why your tablet server would die: >> 1. Lost lock in Zookeeper. If the tablet server and zookeeper can't >> communicate with each other then the lock will timeout and the tablet >> server will kill itself. This should show up as several messages in the >> tserver log. If this happens when a tablet server is really busy (lots of >> threads doing stuff) then the log message about the lost lock can be pretty >> far back in the queue. Java garbage collection can cause long pauses that >> inhibit the tserver/zookeeper messages. Zookeeper can also get overwhelmed >> and behave poorly if the server it's running on swaps it out. >> 2. Problems talking with the master. If a tablet server is too slow in >> communicating with the master then the master will try to kill it. This >> should show up in the master log, and also will be noted in the tserver log. >> 3. Out of memory. If the tserver JVM runs out of memory it will >> terminate. As John mentioned, this will be in the .err or .out files in the >> log directory. >> >> Adam >> >> >> >> On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: >> >>> After running an ingest process via map reduce for about an hour or so, >>> one of our tserver fails. It happens pretty consistently, we're able to >>> replicate it without too much difficulty. >>> >>> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why the >>> tserver fails, but I'm not seeing much that points to a cause of the >>> tserver going offline. One minute it's there, the next it's offline. >>> There are some warnings about the swappiness as well as a large row that >>> cannot be spit but other than that, not much else to go on. >>> >>> Is there anything that could help me figure out *why* the tserver died? >>> I'm guessing it's something in our client code or a config that's not >>> correct on the server, but it'd be really nice to have a hint before we >>> start randomly changing things to see what will fix it. >>> >>> Thanks, >>> >>> Mike >>> >> >> >
-
Re: Determining the cause of a tablet server failureJohn Vines 2013-02-27, 20:32
I don't like the idea of blending manual logging with log4j in a single
file. It's in the .err file already, I don't think anything else is necessary. On Wed, Feb 27, 2013 at 3:27 PM, Adam Fuchs <[EMAIL PROTECTED]> wrote: > So, question for the community: inside bin/accumulo we have: > -XX:OnOutOfMemoryError="kill -9 %p" > Should this also append a log message? Something like: > -XX:OnOutOfMemoryError="kill -9 %p; echo "ran out of memory >> > logfilename" > Is this necessary, or should the OutOfMemoryException still find its way > to the regular log? > > Adam > > > > On Wed, Feb 27, 2013 at 3:17 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: > >> I'm chalking this up to a mis-configured server. It looks like during >> the install on this server the accumulo-env.sh file was copied from the >> examples, but rather than setting editing it to set the JAVA_HOME, >> HADOOP_HOME, and ZOOKEEPER_HOME, the entire file contents were replaced >> with those env variables. >> >> I'm assuming this caused us to pick up the default (?) _OPTS settings >> rather than the correct ones we should have been getting based on our >> server memory capacity from the examples. So we had a bunch of accumulo >> related java processes all running with memory settings that were way out >> of whack from what they should have been. >> >> To solve it I copied in the files from the conf/examples directory again >> and made sure everything was set up correctly and restarted everything. >> >> We never did see anything in out log files or .out / .err logs indicating >> the source of the problem, but the above is my best guess as to what was >> going on. >> >> Thanks again for all the tips and pointers! >> >> Mike >> >> >> On Wed, Feb 27, 2013 at 11:24 AM, Adam Fuchs <[EMAIL PROTECTED]> wrote: >> >>> There are a few primary reasons why your tablet server would die: >>> 1. Lost lock in Zookeeper. If the tablet server and zookeeper can't >>> communicate with each other then the lock will timeout and the tablet >>> server will kill itself. This should show up as several messages in the >>> tserver log. If this happens when a tablet server is really busy (lots of >>> threads doing stuff) then the log message about the lost lock can be pretty >>> far back in the queue. Java garbage collection can cause long pauses that >>> inhibit the tserver/zookeeper messages. Zookeeper can also get overwhelmed >>> and behave poorly if the server it's running on swaps it out. >>> 2. Problems talking with the master. If a tablet server is too slow in >>> communicating with the master then the master will try to kill it. This >>> should show up in the master log, and also will be noted in the tserver log. >>> 3. Out of memory. If the tserver JVM runs out of memory it will >>> terminate. As John mentioned, this will be in the .err or .out files in the >>> log directory. >>> >>> Adam >>> >>> >>> >>> On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: >>> >>>> After running an ingest process via map reduce for about an hour or so, >>>> one of our tserver fails. It happens pretty consistently, we're able to >>>> replicate it without too much difficulty. >>>> >>>> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why >>>> the tserver fails, but I'm not seeing much that points to a cause of the >>>> tserver going offline. One minute it's there, the next it's offline. >>>> There are some warnings about the swappiness as well as a large row that >>>> cannot be spit but other than that, not much else to go on. >>>> >>>> Is there anything that could help me figure out *why* the tserver died? >>>> I'm guessing it's something in our client code or a config that's not >>>> correct on the server, but it'd be really nice to have a hint before we >>>> start randomly changing things to see what will fix it. >>>> >>>> Thanks, >>>> >>>> Mike >>>> >>> >>> >> >
-
Re: Determining the cause of a tablet server failureChristopher 2013-02-27, 22:46
I agree with John Vines.
-- Christopher L Tubbs II http://gravatar.com/ctubbsii On Wed, Feb 27, 2013 at 12:32 PM, John Vines <[EMAIL PROTECTED]> wrote: > I don't like the idea of blending manual logging with log4j in a single > file. It's in the .err file already, I don't think anything else is > necessary. > > > > On Wed, Feb 27, 2013 at 3:27 PM, Adam Fuchs <[EMAIL PROTECTED]> wrote: >> >> So, question for the community: inside bin/accumulo we have: >> -XX:OnOutOfMemoryError="kill -9 %p" >> Should this also append a log message? Something like: >> -XX:OnOutOfMemoryError="kill -9 %p; echo "ran out of memory >> >> logfilename" >> Is this necessary, or should the OutOfMemoryException still find its way >> to the regular log? >> >> Adam >> >> >> >> On Wed, Feb 27, 2013 at 3:17 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: >>> >>> I'm chalking this up to a mis-configured server. It looks like during >>> the install on this server the accumulo-env.sh file was copied from the >>> examples, but rather than setting editing it to set the JAVA_HOME, >>> HADOOP_HOME, and ZOOKEEPER_HOME, the entire file contents were replaced with >>> those env variables. >>> >>> I'm assuming this caused us to pick up the default (?) _OPTS settings >>> rather than the correct ones we should have been getting based on our server >>> memory capacity from the examples. So we had a bunch of accumulo related >>> java processes all running with memory settings that were way out of whack >>> from what they should have been. >>> >>> To solve it I copied in the files from the conf/examples directory again >>> and made sure everything was set up correctly and restarted everything. >>> >>> We never did see anything in out log files or .out / .err logs indicating >>> the source of the problem, but the above is my best guess as to what was >>> going on. >>> >>> Thanks again for all the tips and pointers! >>> >>> Mike >>> >>> >>> On Wed, Feb 27, 2013 at 11:24 AM, Adam Fuchs <[EMAIL PROTECTED]> wrote: >>>> >>>> There are a few primary reasons why your tablet server would die: >>>> 1. Lost lock in Zookeeper. If the tablet server and zookeeper can't >>>> communicate with each other then the lock will timeout and the tablet server >>>> will kill itself. This should show up as several messages in the tserver >>>> log. If this happens when a tablet server is really busy (lots of threads >>>> doing stuff) then the log message about the lost lock can be pretty far back >>>> in the queue. Java garbage collection can cause long pauses that inhibit the >>>> tserver/zookeeper messages. Zookeeper can also get overwhelmed and behave >>>> poorly if the server it's running on swaps it out. >>>> 2. Problems talking with the master. If a tablet server is too slow in >>>> communicating with the master then the master will try to kill it. This >>>> should show up in the master log, and also will be noted in the tserver log. >>>> 3. Out of memory. If the tserver JVM runs out of memory it will >>>> terminate. As John mentioned, this will be in the .err or .out files in the >>>> log directory. >>>> >>>> Adam >>>> >>>> >>>> >>>> On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: >>>>> >>>>> After running an ingest process via map reduce for about an hour or so, >>>>> one of our tserver fails. It happens pretty consistently, we're able to >>>>> replicate it without too much difficulty. >>>>> >>>>> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why >>>>> the tserver fails, but I'm not seeing much that points to a cause of the >>>>> tserver going offline. One minute it's there, the next it's offline. >>>>> There are some warnings about the swappiness as well as a large row that >>>>> cannot be spit but other than that, not much else to go on. >>>>> >>>>> Is there anything that could help me figure out *why* the tserver died? >>>>> I'm guessing it's something in our client code or a config that's not >>>>> correct on the server, but it'd be really nice to have a hint before we
-
Re: Determining the cause of a tablet server failureJosh Elser 2013-02-27, 23:23
Ditto. I don't like the idea of sprinkling additional stuff into log4j, but
I am in favor of trying to make it easier to recognize when tservers die due to OOMs if there are more suggestions. On Wednesday, February 27, 2013, Christopher wrote: > I agree with John Vines. > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > > On Wed, Feb 27, 2013 at 12:32 PM, John Vines <[EMAIL PROTECTED]> wrote: > > I don't like the idea of blending manual logging with log4j in a single > > file. It's in the .err file already, I don't think anything else is > > necessary. > > > > > > > > On Wed, Feb 27, 2013 at 3:27 PM, Adam Fuchs <[EMAIL PROTECTED]> wrote: > >> > >> So, question for the community: inside bin/accumulo we have: > >> -XX:OnOutOfMemoryError="kill -9 %p" > >> Should this also append a log message? Something like: > >> -XX:OnOutOfMemoryError="kill -9 %p; echo "ran out of memory >> > >> logfilename" > >> Is this necessary, or should the OutOfMemoryException still find its way > >> to the regular log? > >> > >> Adam > >> > >> > >> > >> On Wed, Feb 27, 2013 at 3:17 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: > >>> > >>> I'm chalking this up to a mis-configured server. It looks like during > >>> the install on this server the accumulo-env.sh file was copied from the > >>> examples, but rather than setting editing it to set the JAVA_HOME, > >>> HADOOP_HOME, and ZOOKEEPER_HOME, the entire file contents were > replaced with > >>> those env variables. > >>> > >>> I'm assuming this caused us to pick up the default (?) _OPTS settings > >>> rather than the correct ones we should have been getting based on our > server > >>> memory capacity from the examples. So we had a bunch of accumulo > related > >>> java processes all running with memory settings that were way out of > whack > >>> from what they should have been. > >>> > >>> To solve it I copied in the files from the conf/examples directory > again > >>> and made sure everything was set up correctly and restarted everything. > >>> > >>> We never did see anything in out log files or .out / .err logs > indicating > >>> the source of the problem, but the above is my best guess as to what > was > >>> going on. > >>> > >>> Thanks again for all the tips and pointers! > >>> > >>> Mike > >>> > >>> > >>> On Wed, Feb 27, 2013 at 11:24 AM, Adam Fuchs <[EMAIL PROTECTED]> > wrote: > >>>> > >>>> There are a few primary reasons why your tablet server would die: > >>>> 1. Lost lock in Zookeeper. If the tablet server and zookeeper can't > >>>> communicate with each other then the lock will timeout and the tablet > server > >>>> will kill itself. This should show up as several messages in the > tserver > >>>> log. If this happens when a tablet server is really busy (lots of > threads > >>>> doing stuff) then the log message about the lost lock can be pretty > far back > >>>> in the queue. Java garbage collection can cause long pauses that > inhibit the > >>>> tserver/zookeeper messages. Zookeeper can also get overwhelmed and > behave > >>>> poorly if the server it's running on swaps it out. > >>>> 2. Problems talking with the master. If a tablet server is too slow in > >>>> communicating with the master then the master will try to kill it. > This > >>>> should show up in the master log, and also will be noted in the > tserver log. > >>>> 3. Out of memory. If the tserver JVM runs out of memory it will > >>>> terminate. As John mentioned, this will be in the .err or .out files > in the > >>>> log directory. > >>>> > >>>> Adam > >>>> > >>>> > >>>> > >>>> On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote: > >>>>> > >>>>> After running an ingest process via map reduce for about an hour or > so, > >>>>> one of our tserver fails. It happens pretty consistently, we're > able to > >>>>> replicate it without too much difficulty. > >>>>> > >>>>> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why > >>>>> the tserver fails, but I'm not seeing much that points to a cause |