We are using Cassandra 3.x version..
Recently, our production database is going through some instability issues. One of our node is keep going down from every 2 days up to a few of times a day. The node is down due to JVM out of memory. According to my investigation, I suspect that this might be related to the writing and/or running compaction of the large partitions for some of our large data tables. Here's might be what had happened
1. The node went OOM due to unable to de-serialize or compacting some large partitions under some condition due to memory constrains.
2. Once we re-started it, which was usually a few hours later, the other nodes in the cluster were trying to perform the hinted handoff to the down node to patch the missing data. From now on, the down node would have to handle handoff plus the normal data load, which made it even busier.
3. The node was not able to complete the handoff and went down again.
4. This went again and again.
This was not the first time we're seeing this issue. The last time, we fixed the issue by manually stopping some of aggregation jobs for a whole night to allow the node to complete the handoff. We're not too sure about the root cause yet, and we don't have explanation why this happens only to one node. I investigated the issue and found two related JIRAs of Cassandrahttps://issues.apache.org/jira/browse/CASSANDRA-8269
Both JIRA mentioned that this might only be the case with Cassandra 2.x.
Engineer - IT
[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Cisco Systems, Inc.
before you print.
This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.
Please click here<http://www.cisco.com/web/about/doing_business/legal/cri/index.html>
for Company Registration Information.