I am having issues with tablet servers going down due to poor contact times (my hypothesis at least). In the past I have had stability success with smaller clouds (20-40 nodes), but have run into issues with a larger number of nodes (150+). Each node is a datanode, nodemanger, and tablet server. There is a master node that is running the hadoop namenode, hadoop resource manager and accumulo master, monitor, etc. There are three zookeeper nodes. All nodes are vms. This same setup is used on the smaller, stable clouds as well.
I do not believe memory allocation is an issue as I have only given hadoop/yarn (2.2.0) and accumulo (1.5.1) less than half of the available memory. The FATAL errors I have seen are:
Lost tablet server lock (resaon = SESSION_EXPIRED), exiting
Lost ability to monitor tablet server lock, exiting
Other than bumping up rpc timeout (which I have done but would rather not do that and find the root cause of the problem), I have run out of ideas on how to solve this issue.
Does anyone have any insight into why I would be seeing such bad response times between nodes? Are there any configuration parameters I can play with to fix this?
I realize this is a very general question, so let me know if there is any information I can provide to help clarify the issue.
You are hitting the zookeeper timeout, default 30s I believe. You said you are not oversubscribed for memory, but what about CPU? Are you running YARN processes on the same nodes as the tablet servers? Is the tablet server being pushed into swap or starved of CPU?
Thank you for the responses. The number of cpus has been something I have considered. The worker nodes only have 4 cpus. The YARN processes are running on the same nodes as the tablet servers.
On another cloud with 8 cpus for each worker, we have been able to run 10 YARN processes with 2gb memory each. Even though this configuration thrashes the workers (I have seen OS loads over 20), the tablet servers stay up.
I was worried about how many connections would be open on the larger cloud, so I significantly reduced the number of YARN process. Side question: does each worker node have a connection with every other node? If they did, my guess was that there would be significantly more open connections on a 150+ node cloud than a 40 node cloud. For that reason, I only have 2 YARN processes with 2gb memory each on the larger cloud that is seeing the issues. My thought was that each YARN process needs a core, the tablet server needs a core, and OS stuff could probably use a core.
Is there a more elegant way to see if the tablet server is being pushed into swap or starved of CPU other than just watching top during the YARN job?
I did look into zookeeper loads a little bit, but I would be a little surprised to see issues there as the zookeeper nodes on the big cloud (1cpu, 8gb ram) have significantly more ram than the zookeepers on the smaller cloud (1cpu, 1gb ram). I did up the GC memory limit for Accumulo gc as I was seeing issues there early on.
What is the best way to check the iowait times for the ZK transaction log?
Increasing the timeout settings helped a little, but when I tried to increase the number of map tasks for the workers I ran into instability issues.
After re-reading my original post, I think I left out some important details. The type of job I am trying to run is a map reduce ingest that uses batch writers to populate an accumulo table. On previous, smaller clouds, I have had control of disk allocation and made sure to assign a disk per worker to avoid write conflicts. On this larger cloud, the disk management is transparent to me, but I believe the physical disks backing the vms are seen as one large virtual pool. Write times on the big, unstable cloud are very fast, 3-4xtimes that of our smaller clouds, but that is seen when I dd a file on just one vm. I think when all 150+ nodes are writing to disk, more than one node will try to write to the same physical disk and cause problematic iowait% (20-50% at least).
So, given my situation, what is the best way to configure accumulo knowing that the workers share disks and will have write conflicts? Do I just bump resources down for ingest for stability then ramp them up for non-ingest jobs?
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext