My name is Claudiu Soroiu and I am new to hbase/hadoop but not new to distributed computing in FT/HA environments and I see there are a lot of issues reported related to the region server failure.
The main problem I see it is related to recovery time in case of a node failure and distributed log splitting. After some tunning I managed to reduce it to 8 seconds in total and for the moment it fits the needs.
I have one question: *Why there is only one WAL file per region server and not one WAL per region itself? * I haven't found the exact answer anywhere, that's why i'm asking on this list and please point me to the right direction if i missed the list.
My point is that eliminating the need of splitting a log in case of failure reduces the downtime for the regions and the only delay that we will see will be related to transferring data over network to the region servers that will take over the failed regions. This is feasible only if having multiple WAL's per Region Server does not affect the overall write performance.
I do not think its a good idea to have one WAL file per region. All WAL file idea is based on assumption that writing data sequentially reduces average latency and increases total throughput. This is no longer a case in a one WAL file per region approach, you may have hundreds active regions per RS and all sequential writes become random ones and random IO for rotational media is very bad, very bad.
On Mon, Apr 14, 2014 at 2:41 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
On the other hand, 95% of HBase users don't actually configure HDFS to fsync() every edit. Given that, the random writes aren't actually going to cause one seek per write -- they'll get buffered up and written back periodically in a much more efficient fashion.
Plus, in some small number of years, I believe SSDs will be available on most server machines (in a hybrid configuration) so the seeks will cost less even with fsync on.
-Todd On Mon, Apr 14, 2014 at 4:54 PM, Vladimir Rodionov <[EMAIL PROTECTED]>wrote: Todd Lipcon Software Engineer, Cloudera
*On the other hand, 95% of HBase users don't actually configure HDFS to fsync() every edit. Given that, the random writes aren't actually going to cause one seek per write -- they'll get buffered up and written back periodically in a much more efficient fashion.*
Todd, this is in theory. Reality is different. 1 writer is definitely more efficient than 100. This won't scale well. On Mon, Apr 14, 2014 at 6:20 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
On Mon, Apr 14, 2014 at 6:32 PM, Vladimir Rodionov <[EMAIL PROTECTED]>wrote: I'd actually disagree. 100 is probably significantly faster than 1, given that most machines have 12 spindles. So, yes, you'd be multiplexing 8 or so logs per spindle, but even 100 logs only requires a few hundred MB worth of buffer cache in order to get good coalescing of writes into large physical IOs.
If memory is really constrained on your machine, you'll probably get some throughput collapse as you enter some really inefficient dirty throttling, but so long as you leave a few GB unallocated, I bet the reality is much closer to what I said than you might think.
It makes sense to have as many wals as # of spindles / replication factor per machine. This should be decoupled from the number of regions on a region server. So for a cluster with 12 spindles we should likely have at least 4 wals (12 spindles / 3 replication factor), and need to do experiments to see if going to 8 or some higher number makes sense (new wal uses a disruptor pattern which avoids much contention on individual writes). So with your example, your 1000 regions would get sharded into the 4 wals which would maximize io throughput, disk utilization, and reduce time for recovery in the face of failure.
In the case of an SSD world, it makes more sense to have one wal per node once we have decent HSM support in HDFS. The key win here will be in recovery time -- if any RS goes down we only have to replay a regions edits and not have to split or demux different region's edits.
Jon. On Mon, Apr 14, 2014 at 10:37 PM, Vladimir Rodionov <[EMAIL PROTECTED]>wrote: // Jonathan Hsieh (shay) // HBase Tech Lead, Software Engineer, Cloudera // [EMAIL PROTECTED] // @jmhsieh
Thanks for catching that -- it was a typo -- one wal per region. On Tue, Apr 15, 2014 at 8:21 AM, Ted Yu <[EMAIL PROTECTED]> wrote: // Jonathan Hsieh (shay) // HBase Tech Lead, Software Engineer, Cloudera // [EMAIL PROTECTED] // @jmhsieh
# of WALs as roughly spindles / replication factor seems intuitive. Would be interesting to benchmark.
As for one WAL per region, the BigTable paper IIRC says they didn't because of concerns about the number of seeks in the filesystems underlying GFS and because it would reduce the effectiveness of group commit throughput optimization. If WALs are backed by SSD certainly the first consideration no longer holds. We also had a global HDFS file limit to contend with. I know HDFS is incrementally improving the scalabilty of a namespace, but this is still an active consideration. (Or we could try partitioning a deploy over a federated namespace? Could be "interesting". Has anyone tried that? I haven't heard.)
On Tue, Apr 15, 2014 at 7:11 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: Best regards,
Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
In general I have never seen nor heard of Federated Namespaces in the wild, so I would be hesitant to go down that path. But you know for "Science" I would be interested in seeing how that worked out. Would we be looking at 32 WALs per region? At a large cluster with 1000nodes, 100 regions per node, and a WAL per region(I like easy math):
1000*100*32= 3.2 million files for WALs This is not ideal, but it is not horrible if we are using 128MB block sizes etc.
I feel like I am missing something above though. Thoughts? On Tue, Apr 15, 2014 at 12:20 PM, Andrew Purtell <[EMAIL PROTECTED]>wrote: Kevin O'Dell Systems Engineer, Cloudera
You'd probably know better than I Kevin but I'd worry about the 1000*1000*32 case, where HDFS is as (over)committed as the HBase tier. On Tue, Apr 15, 2014 at 9:26 AM, Kevin O'dell <[EMAIL PROTECTED]>wrote: Best regards,
Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
I agree, there is definitely a chance HDFS doesn't have an extra 3GB of NN heap to squeak out for HBase. It would be interesting to check in with the Flurry guys and see what their NN pressure looks like. As clusters become more multi-tenant HDFS pressure could become a real concern. I have not seen too many clusters that have a ton of files and are choking the NN into large GC pauses. Usually, the end user is doing something wrong and we can use something similar to HAR to help clean up some of the FS. On Tue, Apr 15, 2014 at 12:29 PM, Andrew Purtell <[EMAIL PROTECTED]>wrote: Kevin O'Dell Systems Engineer, Cloudera
**how about 300 regions with 3x replication? Or 1000 regions? This is going to be 3000 files. on HDFS. per one RS.**
Now i see that the trade-off is how to reduce the recovery time without affecting the overall performance of the cluster. Having too many WAL's affects the write performance. Basically multiple WAL's might improve the process but the number of WAL's should be relatively small.
Would it be feasible to know ahead of time where a region might activate in case of a failure and have for each region server a second WAL file containing backup edits? E.g. If machine B crashes then a region will be assigned to node A, one to node C, etc. Also another view would be: Server A will backup a region from Server B if crashes, a region from server C, etc. Basically this second WAL will contain the data that is needed to fast recover a crashed node. This adds additional redundancy and some degree of complexity to the solution but ensures data locality in case of a crash and faster recovery.
**What did you do Claudiu to get the time down?**
Decreased the hdfs block size to 64 megs for now. Enabled settings to avoid hdfs stale nodes. Cluster I tested this was relatively small - 10 computers. I did tuning for zookeeper sessions to keep the heartbeat at 5 seconds for the moment, and plan to decrease this value. At this point dfs.heartbeat.interval is set at the default 3 seconds, but this I also plan to decrease and perform a more intensive test. (Decreasing the times is based on the experience with our current system configured at 1.2 seconds and didn't had any issues even under heavy loads, obviously stop the world GC times should be smaller that the heartbeat interval) And I remember i did some changes for the reconnect intervals of the client to allow him to reconnect to the region as fast as possible. I am in an early stage of experimenting with hbase but there are lot of things to test/check... On Tue, Apr 15, 2014 at 11:03 PM, Vladimir Rodionov <[EMAIL PROTECTED]>wrote:
Would the second WAL contain the same contents as the first ?
We already have the code that adds interceptor on the calls to the namenode#getBlockLocations so that blocks on the same DN as the dead RS are placed at the end of the priority queue.. See addLocationsOrderInterceptor() in hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java
This is for faster recovery in case regionserver is deployed on the same box as the datanode. On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu <[EMAIL PROTECTED]> wrote:
Yes, overall the second WAL would contain the same data but differently distributed. A server will have in the second WAL data from the regions that it will take over if they fail. It is just an idea as it might not be good to duplicate the data across the cluster. On Wed, Apr 16, 2014 at 12:36 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
What you described seems to be the favored nodes feature, but there are still some open (and stale...) jiras there: HBASE-9116 and cie. You may also want to look at the hbase.master.distributed.log.replay option, as is allows writes during recovery. And for the client there is hbase.status.published that can help...
On Wed, Apr 16, 2014 at 6:35 AM, Claudiu Soroiu <[EMAIL PROTECTED]> wrote:
On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu <[EMAIL PROTECTED]> wrote:
This sounds like what I called Shadow Memstores. This depends on hdfs file affinity groups, (favored nodes could help but isn't guaranteed), and could be used for super fast edit recovery. See this thread and jira. HEre's a link to a doc I posted on the HBASE-10070 jira. This requires some simplifications on the master side, and should be compatible with the current approach in HBASE-10070.
Thanks for the hints. I will take a look and explore the idea.
Claudiu On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu <[EMAIL PROTECTED]> wrote:
in to This sounds like what I called Shadow Memstores. This depends on hdfs file affinity groups, (favored nodes could help but isn't guaranteed), and could be used for super fast edit recovery. See this thread and jira. HEre's a link to a doc I posted on the HBASE-10070 jira. This requires some simplifications on the master side, and should be compatible with the current approach in HBASE-10070.
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext