|
Varun Sharma
2012-10-30, 17:41
Jean-Marc Spaggiari
2012-10-30, 17:53
Varun Sharma
2012-10-30, 18:03
Marcos Ortiz
2012-10-30, 20:20
Varun Sharma
2012-11-01, 08:01
Jeremy Carroll
2012-11-01, 16:31
Marcos Ortiz Valmaseda
2012-11-01, 11:17
Leonid Fedotov
2012-11-01, 17:09
Patrick Angeles
2012-11-01, 19:11
Patrick Angeles
2012-11-01, 19:20
Stack
2012-11-01, 18:59
Kevin O'dell
2012-10-30, 19:15
Kevin O'dell
2012-10-30, 19:16
|
-
Hbase cluster for serving real time site trafficVarun Sharma 2012-10-30, 17:41
Hi,
We are planning to experiment with a cluster for serving production traffic using hbase for pinterest. We are starting off with a 10 region server + 1 master cluster on Amazon EMR version 0.92. I had some very naive questions (primarily around points of failure): 1) It seems hbase starts only one zookeeper on the master node - which is critical for operation - how many zookeepers should I use and can I run those on the region servers ? 2) How many masters to use - does hbase support multiple masters (primary and secondary) within the same cluster ? From my understanding, master availability is not critical for operation. 3) NameNode - We are running hadoop 0.8 - I have read that NameNode is a single point of failure and we should really be running two name node(s) so we can failover. Is it fine to run these on the region servers ? 4) Our current application involves long row/column - 24-32 bytes with 0-1 bytes of values. Should we be using a different key encoding than the default encoding ? What advantages could it buy us ? We are currently using amazon EMR for testing purposes which runs hbase 0.92. If it works well, we would like to configure our own cluster with probably the latest version of hbase which appears to be 0.94 at the moment. Thanks Varun +
Varun Sharma 2012-10-30, 17:41
-
Re: Hbase cluster for serving real time site trafficJean-Marc Spaggiari 2012-10-30, 17:53
My 2¢.
1) You need an odd number of ZooKeeper nodes. So 3 is the minimum recommanded for production. 2) Yes, you have Master and SecondaryMaster. And it's also recommanded to have one of each. And the master is critical. If you are loosing it, you are loosing your cluster. 3) NameNode is hadoop, not hbase. You should follow hadoop recommandations. Like you have secondarymaster, you have secondarynamenode. So I think you should have as many secondarynamenode as you have secondarymaster (on the same machine?). 4) I'm not sure to understanding this question. Key are binary. Array of bytes. So 32 0-1 bytes is a 3 bytes long array. It's not a lot. This will only give you 2^32 different rows. You will have to pre-split them, or you will end with almost all of them on the same regionserver? JM 2012/10/30, Varun Sharma <[EMAIL PROTECTED]>: > Hi, > > We are planning to experiment with a cluster for serving production traffic > using hbase for pinterest. We are starting off with a 10 region server + 1 > master cluster on Amazon EMR version 0.92. I had some very naive questions > (primarily around points of failure): > > 1) It seems hbase starts only one zookeeper on the master node - which is > critical for operation - how many zookeepers should I use and can I run > those on the region servers ? > 2) How many masters to use - does hbase support multiple masters (primary > and secondary) within the same cluster ? From my understanding, master > availability is not critical for operation. > 3) NameNode - We are running hadoop 0.8 - I have read that NameNode is a > single point of failure and we should really be running two name node(s) so > we can failover. Is it fine to run these on the region servers ? > 4) Our current application involves long row/column - 24-32 bytes with 0-1 > bytes of values. Should we be using a different key encoding than the > default encoding ? What advantages could it buy us ? > > We are currently using amazon EMR for testing purposes which runs hbase > 0.92. If it works well, we would like to configure our own cluster with > probably the latest version of hbase which appears to be 0.94 at the > moment. > > Thanks > Varun > +
Jean-Marc Spaggiari 2012-10-30, 17:53
-
Re: Hbase cluster for serving real time site trafficVarun Sharma 2012-10-30, 18:03
Thanks for the tips.
So, yes, secondary NameNode is probably more critical than the secondary master - since the master is only responsible for metadata changes/region splits/table creation etc and not for writes/reads. Regarding the keys question - i meant that the (row + column) length is 24-32 bytes and the value length is 0-1 bytes. Currently, we have a cluster running with all the data loaded into hbase but it all runs with default settings. Thanks Varun On Tue, Oct 30, 2012 at 10:53 AM, Jean-Marc Spaggiari < [EMAIL PROTECTED]> wrote: > My 2¢. > > 1) You need an odd number of ZooKeeper nodes. So 3 is the minimum > recommanded for production. > 2) Yes, you have Master and SecondaryMaster. And it's also recommanded > to have one of each. And the master is critical. If you are loosing > it, you are loosing your cluster. > 3) NameNode is hadoop, not hbase. You should follow hadoop > recommandations. Like you have secondarymaster, you have > secondarynamenode. So I think you should have as many > secondarynamenode as you have secondarymaster (on the same machine?). > 4) I'm not sure to understanding this question. Key are binary. Array > of bytes. So 32 0-1 bytes is a 3 bytes long array. It's not a lot. > This will only give you 2^32 different rows. You will have to > pre-split them, or you will end with almost all of them on the same > regionserver? > > JM > > 2012/10/30, Varun Sharma <[EMAIL PROTECTED]>: > > Hi, > > > > We are planning to experiment with a cluster for serving production > traffic > > using hbase for pinterest. We are starting off with a 10 region server + > 1 > > master cluster on Amazon EMR version 0.92. I had some very naive > questions > > (primarily around points of failure): > > > > 1) It seems hbase starts only one zookeeper on the master node - which is > > critical for operation - how many zookeepers should I use and can I run > > those on the region servers ? > > 2) How many masters to use - does hbase support multiple masters (primary > > and secondary) within the same cluster ? From my understanding, master > > availability is not critical for operation. > > 3) NameNode - We are running hadoop 0.8 - I have read that NameNode is a > > single point of failure and we should really be running two name node(s) > so > > we can failover. Is it fine to run these on the region servers ? > > 4) Our current application involves long row/column - 24-32 bytes with > 0-1 > > bytes of values. Should we be using a different key encoding than the > > default encoding ? What advantages could it buy us ? > > > > We are currently using amazon EMR for testing purposes which runs hbase > > 0.92. If it works well, we would like to configure our own cluster with > > probably the latest version of hbase which appears to be 0.94 at the > > moment. > > > > Thanks > > Varun > > > +
Varun Sharma 2012-10-30, 18:03
-
Re: Hbase cluster for serving real time site trafficMarcos Ortiz 2012-10-30, 20:20
Regards, Varun, answers in line
On 10/30/2012 01:03 PM, Varun Sharma wrote: > Thanks for the tips. > > So, yes, secondary NameNode is probably more critical than the secondary > master - since the master is only responsible for metadata changes/region > splits/table creation etc and not for writes/reads. Exactly, you have to create a good HA strategy for these nodes (Master and Secondary Master) > > Regarding the keys question - i meant that the (row + column) length is > 24-32 bytes and the value length is 0-1 bytes. Currently, we have a cluster > running with all the data loaded into hbase but it all runs with default > settings. There are many areas that you can optimize in a HBase cluster: - Write operations - Compactions and Split optimization - Region Servers size - Snappy compression - Schema design - Use of Block caching to Scan optimization - Use of asynchronous clients for HBase operations (asynchbase for example[1]) etc The excellent Lars's book: "HBase: The Definitive Guide" has a completed chapter for this tricky topic (Chapter 11) Some additional resources: [1] https://github.com/stumbleupon/asynchbase https://github.com/twitter/finagle http://gbif.blogspot.com/2012/02/performance-evaluation-of-hbase.html http://gbif.blogspot.com/2012/02/monitoring-hadoop-and-hbase.html http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ Look at Slidehare all tagged presentations from the last HBaseCon, for example the Benoit's talk about "Lessons learned from OpenTSDB" and Lars Hofhansl's "HBase Schema Design": http://www.slideshare.net/cloudera/tag/hbasecon-2012 Best wishes > > Thanks > Varun > > On Tue, Oct 30, 2012 at 10:53 AM, Jean-Marc Spaggiari < > [EMAIL PROTECTED]> wrote: > >> My 2�. >> >> 1) You need an odd number of ZooKeeper nodes. So 3 is the minimum >> recommanded for production. >> 2) Yes, you have Master and SecondaryMaster. And it's also recommanded >> to have one of each. And the master is critical. If you are loosing >> it, you are loosing your cluster. >> 3) NameNode is hadoop, not hbase. You should follow hadoop >> recommandations. Like you have secondarymaster, you have >> secondarynamenode. So I think you should have as many >> secondarynamenode as you have secondarymaster (on the same machine?). >> 4) I'm not sure to understanding this question. Key are binary. Array >> of bytes. So 32 0-1 bytes is a 3 bytes long array. It's not a lot. >> This will only give you 2^32 different rows. You will have to >> pre-split them, or you will end with almost all of them on the same >> regionserver? >> >> JM >> >> 2012/10/30, Varun Sharma <[EMAIL PROTECTED]>: >>> Hi, >>> >>> We are planning to experiment with a cluster for serving production >> traffic >>> using hbase for pinterest. We are starting off with a 10 region server + >> 1 >>> master cluster on Amazon EMR version 0.92. I had some very naive >> questions >>> (primarily around points of failure): >>> >>> 1) It seems hbase starts only one zookeeper on the master node - which is >>> critical for operation - how many zookeepers should I use and can I run >>> those on the region servers ? >>> 2) How many masters to use - does hbase support multiple masters (primary >>> and secondary) within the same cluster ? From my understanding, master >>> availability is not critical for operation. >>> 3) NameNode - We are running hadoop 0.8 - I have read that NameNode is a >>> single point of failure and we should really be running two name node(s) >> so >>> we can failover. Is it fine to run these on the region servers ? >>> 4) Our current application involves long row/column - 24-32 bytes with >> 0-1 >>> bytes of values. Should we be using a different key encoding than the >>> default encoding ? What advantages could it buy us ? >>> >>> We are currently using amazon EMR for testing purposes which runs hbase >>> 0.92. If it works well, we would like to configure our own cluster with >>> probably the latest version of hbase which appears to be 0.94 at the Marcos Luis Ort�z Valmaseda about.me/marcosortiz <http://about.me/marcosortiz> @marcosluis2186 <http://twitter.com/marcosluis2186> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci +
Marcos Ortiz 2012-10-30, 20:20
-
Re: Hbase cluster for serving real time site trafficVarun Sharma 2012-11-01, 08:01
Thanks all for the helpful comments. I read up on HA and was wondering if
there are good tools for setting up a HA HDFS + Hbase cluster on EC2 quickly. From my reading, it appears that tools like Whirr still have issues with bringing up the secondary NN on a different machine etc. Also for availability, would Master-Slave replication or Master-Master replication be a substitute for having the secondary NN. For zookeeper, should the servers be running ZK only or is it fine to share with other services like the master ? Also, is it better to have a dedicated zookeeper cluster per hbase cluster ? Thanks Varun On Tue, Oct 30, 2012 at 1:20 PM, Marcos Ortiz <[EMAIL PROTECTED]> wrote: > Regards, Varun, answers in line > > On 10/30/2012 01:03 PM, Varun Sharma wrote: > > Thanks for the tips. > > So, yes, secondary NameNode is probably more critical than the secondary > master - since the master is only responsible for metadata changes/region > splits/table creation etc and not for writes/reads. > > Exactly, you have to create a good HA strategy for these nodes (Master > and Secondary Master) > > > Regarding the keys question - i meant that the (row + column) length is > 24-32 bytes and the value length is 0-1 bytes. Currently, we have a cluster > running with all the data loaded into hbase but it all runs with default > settings. > > There are many areas that you can optimize in a HBase cluster: > - Write operations > - Compactions and Split optimization > - Region Servers size > - Snappy compression > - Schema design > - Use of Block caching to Scan optimization > - Use of asynchronous clients for HBase operations (asynchbase for > example[1]) > etc > > The excellent Lars's book: "HBase: The Definitive Guide" has a completed > chapter for this tricky topic (Chapter 11) > > Some additional resources: > > [1] https://github.com/stumbleupon/asynchbase > https://github.com/twitter/finagle > http://gbif.blogspot.com/2012/02/performance-evaluation-of-hbase.html > http://gbif.blogspot.com/2012/02/monitoring-hadoop-and-hbase.html > http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ > > Look at Slidehare all tagged presentations from the last HBaseCon, for > example the Benoit's talk about > "Lessons learned from OpenTSDB" and Lars Hofhansl's "HBase Schema Design": > http://www.slideshare.net/cloudera/tag/hbasecon-2012 > > Best wishes > > Thanks > Varun > > On Tue, Oct 30, 2012 at 10:53 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote: > > > My 2¢. > > 1) You need an odd number of ZooKeeper nodes. So 3 is the minimum > recommanded for production. > 2) Yes, you have Master and SecondaryMaster. And it's also recommanded > to have one of each. And the master is critical. If you are loosing > it, you are loosing your cluster. > 3) NameNode is hadoop, not hbase. You should follow hadoop > recommandations. Like you have secondarymaster, you have > secondarynamenode. So I think you should have as many > secondarynamenode as you have secondarymaster (on the same machine?). > 4) I'm not sure to understanding this question. Key are binary. Array > of bytes. So 32 0-1 bytes is a 3 bytes long array. It's not a lot. > This will only give you 2^32 different rows. You will have to > pre-split them, or you will end with almost all of them on the same > regionserver? > > JM > > 2012/10/30, Varun Sharma <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>: > > Hi, > > We are planning to experiment with a cluster for serving production > > traffic > > using hbase for pinterest. We are starting off with a 10 region server + > > 1 > > master cluster on Amazon EMR version 0.92. I had some very naive > > questions > > (primarily around points of failure): > > 1) It seems hbase starts only one zookeeper on the master node - which is > critical for operation - how many zookeepers should I use and can I run > those on the region servers ? > 2) How many masters to use - does hbase support multiple masters (primary > and secondary) within the same cluster ? From my understanding, master +
Varun Sharma 2012-11-01, 08:01
-
Re: Hbase cluster for serving real time site trafficJeremy Carroll 2012-11-01, 16:31
In production you would want 3, 5, or 7, etc... ZK's (Odd number) for
Quorum reasons. They should be dedicated on a machine, but it does not have to be a very big one. Updated to ZK are applied to disk before they are in memory for recoverability, so having faster disks helps once you start getting more ZK traffic. Once you go to production 3 nodes should be fine ( http://zookeeper.apache.org/doc/r3.2.2/zookeeperOver.html#Implementation). On Thu, Nov 1, 2012 at 1:01 AM, Varun Sharma <[EMAIL PROTECTED]> wrote: > Thanks all for the helpful comments. I read up on HA and was wondering if > there are good tools for setting up a HA HDFS + Hbase cluster on EC2 > quickly. From my reading, it appears that tools like Whirr still have > issues with bringing up the secondary NN on a different machine etc. Also > for availability, would Master-Slave replication or Master-Master > replication be a substitute for having the secondary NN. > > For zookeeper, should the servers be running ZK only or is it fine to share > with other services like the master ? Also, is it better to have a > dedicated zookeeper cluster per hbase cluster ? > > Thanks > Varun > > On Tue, Oct 30, 2012 at 1:20 PM, Marcos Ortiz <[EMAIL PROTECTED]> wrote: > > > Regards, Varun, answers in line > > > > On 10/30/2012 01:03 PM, Varun Sharma wrote: > > > > Thanks for the tips. > > > > So, yes, secondary NameNode is probably more critical than the secondary > > master - since the master is only responsible for metadata changes/region > > splits/table creation etc and not for writes/reads. > > > > Exactly, you have to create a good HA strategy for these nodes (Master > > and Secondary Master) > > > > > > Regarding the keys question - i meant that the (row + column) length is > > 24-32 bytes and the value length is 0-1 bytes. Currently, we have a > cluster > > running with all the data loaded into hbase but it all runs with default > > settings. > > > > There are many areas that you can optimize in a HBase cluster: > > - Write operations > > - Compactions and Split optimization > > - Region Servers size > > - Snappy compression > > - Schema design > > - Use of Block caching to Scan optimization > > - Use of asynchronous clients for HBase operations (asynchbase for > > example[1]) > > etc > > > > The excellent Lars's book: "HBase: The Definitive Guide" has a completed > > chapter for this tricky topic (Chapter 11) > > > > Some additional resources: > > > > [1] https://github.com/stumbleupon/asynchbase > > https://github.com/twitter/finagle > > http://gbif.blogspot.com/2012/02/performance-evaluation-of-hbase.html > > http://gbif.blogspot.com/2012/02/monitoring-hadoop-and-hbase.html > > http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ > > > > Look at Slidehare all tagged presentations from the last HBaseCon, for > > example the Benoit's talk about > > "Lessons learned from OpenTSDB" and Lars Hofhansl's "HBase Schema > Design": > > http://www.slideshare.net/cloudera/tag/hbasecon-2012 > > > > Best wishes > > > > Thanks > > Varun > > > > On Tue, Oct 30, 2012 at 10:53 AM, Jean-Marc Spaggiari < > [EMAIL PROTECTED]> wrote: > > > > > > My 2¢. > > > > 1) You need an odd number of ZooKeeper nodes. So 3 is the minimum > > recommanded for production. > > 2) Yes, you have Master and SecondaryMaster. And it's also recommanded > > to have one of each. And the master is critical. If you are loosing > > it, you are loosing your cluster. > > 3) NameNode is hadoop, not hbase. You should follow hadoop > > recommandations. Like you have secondarymaster, you have > > secondarynamenode. So I think you should have as many > > secondarynamenode as you have secondarymaster (on the same machine?). > > 4) I'm not sure to understanding this question. Key are binary. Array > > of bytes. So 32 0-1 bytes is a 3 bytes long array. It's not a lot. > > This will only give you 2^32 different rows. You will have to > > pre-split them, or you will end with almost all of them on the same > > regionserver? +
Jeremy Carroll 2012-11-01, 16:31
-
Re: Hbase cluster for serving real time site trafficMarcos Ortiz Valmaseda 2012-11-01, 11:17
Regards, Varun.
1- I think that you should take a look to the Cloudera Manager for CDH 4.1 to create a HA HDFS enviroment. Remember that the version 2.0.x is not ready for production yet. The stable version is Hadoop 1.0.4 with HBase 0.94.2 2- Yes, a recommended practice is to have a separate Zookeeper ensemble (three, five or seven are good numbers for the ensemble) from your NN, HB Master. For example: - 1 NN/HB Master, JT - 5 DN, HR Servers, TT - 3 nodes for the Zookeeper quorum. Best wishes. ----- Mensaje original ----- De: Varun Sharma <[EMAIL PROTECTED]> Para: Marcos Ortiz <[EMAIL PROTECTED]>, kevin odell <[EMAIL PROTECTED]> CC: [EMAIL PROTECTED] Enviado: Thu, 01 Nov 2012 03:01:55 -0500 (CST) Asunto: Re: Hbase cluster for serving real time site traffic Thanks all for the helpful comments. I read up on HA and was wondering if there are good tools for setting up a HA HDFS + Hbase cluster on EC2 quickly. From my reading, it appears that tools like Whirr still have issues with bringing up the secondary NN on a different machine etc. Also for availability, would Master-Slave replication or Master-Master replication be a substitute for having the secondary NN. For zookeeper, should the servers be running ZK only or is it fine to share with other services like the master ? Also, is it better to have a dedicated zookeeper cluster per hbase cluster ? Thanks Varun On Tue, Oct 30, 2012 at 1:20 PM, Marcos Ortiz <[EMAIL PROTECTED]> wrote: > Regards, Varun, answers in line > > On 10/30/2012 01:03 PM, Varun Sharma wrote: > > Thanks for the tips. > > So, yes, secondary NameNode is probably more critical than the secondary > master - since the master is only responsible for metadata changes/region > splits/table creation etc and not for writes/reads. > > Exactly, you have to create a good HA strategy for these nodes (Master > and Secondary Master) > > > Regarding the keys question - i meant that the (row + column) length is > 24-32 bytes and the value length is 0-1 bytes. Currently, we have a cluster > running with all the data loaded into hbase but it all runs with default > settings. > > There are many areas that you can optimize in a HBase cluster: > - Write operations > - Compactions and Split optimization > - Region Servers size > - Snappy compression > - Schema design > - Use of Block caching to Scan optimization > - Use of asynchronous clients for HBase operations (asynchbase for > example[1]) > etc > > The excellent Lars's book: "HBase: The Definitive Guide" has a completed > chapter for this tricky topic (Chapter 11) > > Some additional resources: > > [1] https://github.com/stumbleupon/asynchbase > https://github.com/twitter/finagle > http://gbif.blogspot.com/2012/02/performance-evaluation-of-hbase.html > http://gbif.blogspot.com/2012/02/monitoring-hadoop-and-hbase.html > http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ > > Look at Slidehare all tagged presentations from the last HBaseCon, for > example the Benoit's talk about > "Lessons learned from OpenTSDB" and Lars Hofhansl's "HBase Schema Design": > http://www.slideshare.net/cloudera/tag/hbasecon-2012 > > Best wishes > > Thanks > Varun > > On Tue, Oct 30, 2012 at 10:53 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote: > > > My 2¢. > > 1) You need an odd number of ZooKeeper nodes. So 3 is the minimum > recommanded for production. > 2) Yes, you have Master and SecondaryMaster. And it's also recommanded > to have one of each. And the master is critical. If you are loosing > it, you are loosing your cluster. > 3) NameNode is hadoop, not hbase. You should follow hadoop > recommandations. Like you have secondarymaster, you have > secondarynamenode. So I think you should have as many > secondarynamenode as you have secondarymaster (on the same machine?). > 4) I'm not sure to understanding this question. Key are binary. Array > of bytes. So 32 0-1 bytes is a 3 bytes long array. It's not a lot. > This will only give you 2^32 different rows. You will have to 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci +
Marcos Ortiz Valmaseda 2012-11-01, 11:17
-
Re: Hbase cluster for serving real time site trafficLeonid Fedotov 2012-11-01, 17:09
Varun,
for HA NameNode you may want to look at Hortonworks HDP 1.1 release. It supported on vSphere and on RedHat HA cluster. HDP 1.1 based on Hadoop 1.0.3 and fully certified for production environments. Do not forget, Hadoop 2.0 is still in alpha testing stage and a can not be recommended for production systems. As of ZK nodes: depending on the amount of ZK traffic, you may not need to put it to the separate nodes, it could easily coexist with DN . However, it is better to split NN and HBmaster to separate nodes. Like NN on one node and HB Master and JT on other node. Thank you! Sincerely, Leonid Fedotov Technical Support Engineer [EMAIL PROTECTED] office: +1 855 846 7866 ext 292 mobile: +1 650 430 1673 On Nov 1, 2012, at 4:17 AM, Marcos Ortiz Valmaseda wrote: > Regards, Varun. > 1- I think that you should take a look to the Cloudera Manager for CDH 4.1 to create a > HA HDFS enviroment. Remember that the version 2.0.x is not ready for production yet. The stable version is Hadoop 1.0.4 with HBase 0.94.2 > > 2- Yes, a recommended practice is to have a separate Zookeeper ensemble (three, five or seven are good numbers for the ensemble) from your NN, HB Master. For example: > - 1 NN/HB Master, JT > - 5 DN, HR Servers, TT > - 3 nodes for the Zookeeper quorum. > > Best wishes. > > ----- Mensaje original ----- > De: Varun Sharma <[EMAIL PROTECTED]> > Para: Marcos Ortiz <[EMAIL PROTECTED]>, kevin odell <[EMAIL PROTECTED]> > CC: [EMAIL PROTECTED] > Enviado: Thu, 01 Nov 2012 03:01:55 -0500 (CST) > Asunto: Re: Hbase cluster for serving real time site traffic > > Thanks all for the helpful comments. I read up on HA and was wondering if > there are good tools for setting up a HA HDFS + Hbase cluster on EC2 > quickly. From my reading, it appears that tools like Whirr still have > issues with bringing up the secondary NN on a different machine etc. Also > for availability, would Master-Slave replication or Master-Master > replication be a substitute for having the secondary NN. > > For zookeeper, should the servers be running ZK only or is it fine to share > with other services like the master ? Also, is it better to have a > dedicated zookeeper cluster per hbase cluster ? > > Thanks > Varun > > On Tue, Oct 30, 2012 at 1:20 PM, Marcos Ortiz <[EMAIL PROTECTED]> wrote: > >> Regards, Varun, answers in line >> >> On 10/30/2012 01:03 PM, Varun Sharma wrote: >> >> Thanks for the tips. >> >> So, yes, secondary NameNode is probably more critical than the secondary >> master - since the master is only responsible for metadata changes/region >> splits/table creation etc and not for writes/reads. >> >> Exactly, you have to create a good HA strategy for these nodes (Master >> and Secondary Master) >> >> >> Regarding the keys question - i meant that the (row + column) length is >> 24-32 bytes and the value length is 0-1 bytes. Currently, we have a cluster >> running with all the data loaded into hbase but it all runs with default >> settings. >> >> There are many areas that you can optimize in a HBase cluster: >> - Write operations >> - Compactions and Split optimization >> - Region Servers size >> - Snappy compression >> - Schema design >> - Use of Block caching to Scan optimization >> - Use of asynchronous clients for HBase operations (asynchbase for >> example[1]) >> etc >> >> The excellent Lars's book: "HBase: The Definitive Guide" has a completed >> chapter for this tricky topic (Chapter 11) >> >> Some additional resources: >> >> [1] https://github.com/stumbleupon/asynchbase >> https://github.com/twitter/finagle >> http://gbif.blogspot.com/2012/02/performance-evaluation-of-hbase.html >> http://gbif.blogspot.com/2012/02/monitoring-hadoop-and-hbase.html >> http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ >> >> Look at Slidehare all tagged presentations from the last HBaseCon, for >> example the Benoit's talk about >> "Lessons learned from OpenTSDB" and Lars Hofhansl's "HBase Schema Design": >> http://www.slideshare.net/cloudera/tag/hbasecon-2012 +
Leonid Fedotov 2012-11-01, 17:09
-
Re: Hbase cluster for serving real time site trafficPatrick Angeles 2012-11-01, 19:11
On Thu, Nov 1, 2012 at 1:09 PM, Leonid Fedotov <[EMAIL PROTECTED]>wrote:
> Varun, > for HA NameNode you may want to look at Hortonworks HDP 1.1 release. It > supported on vSphere and on RedHat HA cluster. > HDP 1.1 based on Hadoop 1.0.3 and fully certified for production > environments. > Do not forget, Hadoop 2.0 is still in alpha testing stage and a can not be > recommended for production systems. > HA Namenode is actually running in a number of HBase production systems. > As of ZK nodes: > depending on the amount of ZK traffic, you may not need to put it to the > separate nodes, it could easily coexist with DN . > This is a very bad idea. You should never co-locate ZK on a worker node, as it can starve of CPU or IOPs and time-out (thereby causing cascading failures). This can happen, for example, when someone submits an MR job. > However, it is better to split NN and HBmaster to separate nodes. Like NN > on one node and HB Master and JT on other node. > Why? The HMaster exerts very little load on the host. If you have three masters and want HA, you can have the following config: Host 1: Primary NN, HMaster1, ZK1 Host 2: Standby NN, HMaster2, ZK2 Host 3: JT, HMaster3, ZK3 > > Thank you! > > Sincerely, > Leonid Fedotov > Technical Support Engineer > [EMAIL PROTECTED] > office: +1 855 846 7866 ext 292 > mobile: +1 650 430 1673 > > On Nov 1, 2012, at 4:17 AM, Marcos Ortiz Valmaseda wrote: > > > Regards, Varun. > > 1- I think that you should take a look to the Cloudera Manager for CDH > 4.1 to create a > > HA HDFS enviroment. Remember that the version 2.0.x is not ready for > production yet. The stable version is Hadoop 1.0.4 with HBase 0.94.2 > > > > 2- Yes, a recommended practice is to have a separate Zookeeper ensemble > (three, five or seven are good numbers for the ensemble) from your NN, HB > Master. For example: > > - 1 NN/HB Master, JT > > - 5 DN, HR Servers, TT > > - 3 nodes for the Zookeeper quorum. > > > > Best wishes. > > > > ----- Mensaje original ----- > > De: Varun Sharma <[EMAIL PROTECTED]> > > Para: Marcos Ortiz <[EMAIL PROTECTED]>, kevin odell < > [EMAIL PROTECTED]> > > CC: [EMAIL PROTECTED] > > Enviado: Thu, 01 Nov 2012 03:01:55 -0500 (CST) > > Asunto: Re: Hbase cluster for serving real time site traffic > > > > Thanks all for the helpful comments. I read up on HA and was wondering if > > there are good tools for setting up a HA HDFS + Hbase cluster on EC2 > > quickly. From my reading, it appears that tools like Whirr still have > > issues with bringing up the secondary NN on a different machine etc. Also > > for availability, would Master-Slave replication or Master-Master > > replication be a substitute for having the secondary NN. > > > > For zookeeper, should the servers be running ZK only or is it fine to > share > > with other services like the master ? Also, is it better to have a > > dedicated zookeeper cluster per hbase cluster ? > > > > Thanks > > Varun > > > > On Tue, Oct 30, 2012 at 1:20 PM, Marcos Ortiz <[EMAIL PROTECTED]> wrote: > > > >> Regards, Varun, answers in line > >> > >> On 10/30/2012 01:03 PM, Varun Sharma wrote: > >> > >> Thanks for the tips. > >> > >> So, yes, secondary NameNode is probably more critical than the secondary > >> master - since the master is only responsible for metadata > changes/region > >> splits/table creation etc and not for writes/reads. > >> > >> Exactly, you have to create a good HA strategy for these nodes (Master > >> and Secondary Master) > >> > >> > >> Regarding the keys question - i meant that the (row + column) length is > >> 24-32 bytes and the value length is 0-1 bytes. Currently, we have a > cluster > >> running with all the data loaded into hbase but it all runs with default > >> settings. > >> > >> There are many areas that you can optimize in a HBase cluster: > >> - Write operations > >> - Compactions and Split optimization > >> - Region Servers size > >> - Snappy compression > >> - Schema design > >> - Use of Block caching to Scan optimization +
Patrick Angeles 2012-11-01, 19:11
-
Re: Hbase cluster for serving real time site trafficPatrick Angeles 2012-11-01, 19:20
I should have added, that, if you have one host for all the master roles
(NN, JT, HMaster) then you may as well go with a single ZK node (quorum 1) on that same server. On Thu, Nov 1, 2012 at 3:11 PM, Patrick Angeles <[EMAIL PROTECTED]>wrote: > > > On Thu, Nov 1, 2012 at 1:09 PM, Leonid Fedotov <[EMAIL PROTECTED]>wrote: > >> Varun, >> for HA NameNode you may want to look at Hortonworks HDP 1.1 release. It >> supported on vSphere and on RedHat HA cluster. >> HDP 1.1 based on Hadoop 1.0.3 and fully certified for production >> environments. >> Do not forget, Hadoop 2.0 is still in alpha testing stage and a can not >> be recommended for production systems. >> > > HA Namenode is actually running in a number of HBase production systems. > > >> As of ZK nodes: >> depending on the amount of ZK traffic, you may not need to put it to the >> separate nodes, it could easily coexist with DN . >> > > This is a very bad idea. You should never co-locate ZK on a worker node, > as it can starve of CPU or IOPs and time-out (thereby causing cascading > failures). This can happen, for example, when someone submits an MR job. > > >> However, it is better to split NN and HBmaster to separate nodes. Like NN >> on one node and HB Master and JT on other node. >> > > Why? The HMaster exerts very little load on the host. If you have three > masters and want HA, you can have the following config: > > Host 1: Primary NN, HMaster1, ZK1 > Host 2: Standby NN, HMaster2, ZK2 > Host 3: JT, HMaster3, ZK3 > > >> >> Thank you! >> >> Sincerely, >> Leonid Fedotov >> Technical Support Engineer >> [EMAIL PROTECTED] >> office: +1 855 846 7866 ext 292 >> mobile: +1 650 430 1673 >> >> On Nov 1, 2012, at 4:17 AM, Marcos Ortiz Valmaseda wrote: >> >> > Regards, Varun. >> > 1- I think that you should take a look to the Cloudera Manager for CDH >> 4.1 to create a >> > HA HDFS enviroment. Remember that the version 2.0.x is not ready for >> production yet. The stable version is Hadoop 1.0.4 with HBase 0.94.2 >> > >> > 2- Yes, a recommended practice is to have a separate Zookeeper ensemble >> (three, five or seven are good numbers for the ensemble) from your NN, HB >> Master. For example: >> > - 1 NN/HB Master, JT >> > - 5 DN, HR Servers, TT >> > - 3 nodes for the Zookeeper quorum. >> > >> > Best wishes. >> > >> > ----- Mensaje original ----- >> > De: Varun Sharma <[EMAIL PROTECTED]> >> > Para: Marcos Ortiz <[EMAIL PROTECTED]>, kevin odell < >> [EMAIL PROTECTED]> >> > CC: [EMAIL PROTECTED] >> > Enviado: Thu, 01 Nov 2012 03:01:55 -0500 (CST) >> > Asunto: Re: Hbase cluster for serving real time site traffic >> > >> > Thanks all for the helpful comments. I read up on HA and was wondering >> if >> > there are good tools for setting up a HA HDFS + Hbase cluster on EC2 >> > quickly. From my reading, it appears that tools like Whirr still have >> > issues with bringing up the secondary NN on a different machine etc. >> Also >> > for availability, would Master-Slave replication or Master-Master >> > replication be a substitute for having the secondary NN. >> > >> > For zookeeper, should the servers be running ZK only or is it fine to >> share >> > with other services like the master ? Also, is it better to have a >> > dedicated zookeeper cluster per hbase cluster ? >> > >> > Thanks >> > Varun >> > >> > On Tue, Oct 30, 2012 at 1:20 PM, Marcos Ortiz <[EMAIL PROTECTED]> wrote: >> > >> >> Regards, Varun, answers in line >> >> >> >> On 10/30/2012 01:03 PM, Varun Sharma wrote: >> >> >> >> Thanks for the tips. >> >> >> >> So, yes, secondary NameNode is probably more critical than the >> secondary >> >> master - since the master is only responsible for metadata >> changes/region >> >> splits/table creation etc and not for writes/reads. >> >> >> >> Exactly, you have to create a good HA strategy for these nodes (Master >> >> and Secondary Master) >> >> >> >> >> >> Regarding the keys question - i meant that the (row + column) length is >> >> 24-32 bytes and the value length is 0-1 bytes. Currently, we have a +
Patrick Angeles 2012-11-01, 19:20
-
Re: Hbase cluster for serving real time site trafficStack 2012-11-01, 18:59
On Thu, Nov 1, 2012 at 10:09 AM, Leonid Fedotov <[EMAIL PROTECTED]>wrote:
> Varun, > for HA NameNode you may want to look at Hortonworks HDP 1.1 release. It > supported on vSphere and on RedHat HA cluster. > HDP 1.1 based on Hadoop 1.0.3 and fully certified for production > environments. > Do not forget, Hadoop 2.0 is still in alpha testing stage and a can not be > recommended for production systems. > > Hey Leonid: We try to keep these mailing lists sales pitch free. Please refrain from posting commercial pitches like the above. If we start to slip at all on this rule, all vendors will think they have license to start dumping the merits of their wares here and this list will fail to pass my spam filter. I'd like to avoid that. Thanks, St.Ack +
Stack 2012-11-01, 18:59
-
Re: Hbase cluster for serving real time site trafficKevin O'dell 2012-10-30, 19:15
Varun,
I will take a shot at answering this: 1) It seems hbase starts only one zookeeper on the master node - which is critical for operation - how many zookeepers should I use and can I run those on the region servers ? <-- 3 and they should be on dedicated servers for a real production environment. 2) How many masters to use - does hbase support multiple masters (primary and secondary) within the same cluster ? From my understanding, master availability is not critical for operation. <--2 if you lose the master you lose HBase. The Master is VERY critical. 3) NameNode - We are running hadoop 0.8 - I have read that NameNode is a single point of failure and we should really be running two name node(s) so we can failover. Is it fine to run these on the region servers ? 2, you will want to use HA for a real production workload. The SNN(Secondary Name Node) is a very misleading name. So, yes, secondary NameNode is probably more critical than the secondary master - since the master is only responsible for metadata changes/region splits/table creation etc and not for writes/reads. <--- This is not correct. The Secondary Name Node is not a failover node. You will want to use a release that has HA to guarantee availability at the NN level. The master is in charge of META data operations, but also with out the Master the RS will not continue to just work. It is very important to have two masters. I will defer Jean-Marc on the Schema designs. On Tue, Oct 30, 2012 at 1:03 PM, Varun Sharma <[EMAIL PROTECTED]> wrote: > Thanks for the tips. > > So, yes, secondary NameNode is probably more critical than the secondary > master - since the master is only responsible for metadata changes/region > splits/table creation etc and not for writes/reads. > > Regarding the keys question - i meant that the (row + column) length is > 24-32 bytes and the value length is 0-1 bytes. Currently, we have a cluster > running with all the data loaded into hbase but it all runs with default > settings. > > Thanks > Varun > > On Tue, Oct 30, 2012 at 10:53 AM, Jean-Marc Spaggiari < > [EMAIL PROTECTED]> wrote: > > > My 2¢. > > > > 1) You need an odd number of ZooKeeper nodes. So 3 is the minimum > > recommanded for production. > > 2) Yes, you have Master and SecondaryMaster. And it's also recommanded > > to have one of each. And the master is critical. If you are loosing > > it, you are loosing your cluster. > > 3) NameNode is hadoop, not hbase. You should follow hadoop > > recommandations. Like you have secondarymaster, you have > > secondarynamenode. So I think you should have as many > > secondarynamenode as you have secondarymaster (on the same machine?). > > 4) I'm not sure to understanding this question. Key are binary. Array > > of bytes. So 32 0-1 bytes is a 3 bytes long array. It's not a lot. > > This will only give you 2^32 different rows. You will have to > > pre-split them, or you will end with almost all of them on the same > > regionserver? > > > > JM > > > > 2012/10/30, Varun Sharma <[EMAIL PROTECTED]>: > > > Hi, > > > > > > We are planning to experiment with a cluster for serving production > > traffic > > > using hbase for pinterest. We are starting off with a 10 region server > + > > 1 > > > master cluster on Amazon EMR version 0.92. I had some very naive > > questions > > > (primarily around points of failure): > > > > > > 1) It seems hbase starts only one zookeeper on the master node - which > is > > > critical for operation - how many zookeepers should I use and can I run > > > those on the region servers ? > > > 2) How many masters to use - does hbase support multiple masters > (primary > > > and secondary) within the same cluster ? From my understanding, master > > > availability is not critical for operation. > > > 3) NameNode - We are running hadoop 0.8 - I have read that NameNode is > a > > > single point of failure and we should really be running two name > node(s) > > so > > > we can failover. Is it fine to run these on the region servers ? Kevin O'Dell Customer Operations Engineer, Cloudera +
Kevin O'dell 2012-10-30, 19:15
-
Re: Hbase cluster for serving real time site trafficKevin O'dell 2012-10-30, 19:16
Sorry I also forgot. Do not run your NN and failover node with other
services. On Tue, Oct 30, 2012 at 2:15 PM, Kevin O'dell <[EMAIL PROTECTED]>wrote: > Varun, > > I will take a shot at answering this: > > 1) It seems hbase starts only one zookeeper on the master node - which is > critical for operation - how many zookeepers should I use and can I run > those on the region servers ? <-- 3 and they should be on dedicated > servers for a real production environment. > > 2) How many masters to use - does hbase support multiple masters (primary > and secondary) within the same cluster ? From my understanding, master > availability is not critical for operation. <--2 if you lose the master > you lose HBase. The Master is VERY critical. > > 3) NameNode - We are running hadoop 0.8 - I have read that NameNode is a > single point of failure and we should really be running two name node(s) so > we can failover. Is it fine to run these on the region servers ? 2, you > will want to use HA for a real production workload. The SNN(Secondary Name > Node) is a very misleading name. > > So, yes, secondary NameNode is probably more critical than the secondary > master - since the master is only responsible for metadata changes/region > splits/table creation etc and not for writes/reads. <--- This is not > correct. The Secondary Name Node is not a failover node. You will want to > use a release that has HA to guarantee availability at the NN level. The > master is in charge of META data operations, but also with out the Master > the RS will not continue to just work. It is very important to have two > masters. > > I will defer Jean-Marc on the Schema designs. > > > > On Tue, Oct 30, 2012 at 1:03 PM, Varun Sharma <[EMAIL PROTECTED]> wrote: > >> Thanks for the tips. >> >> So, yes, secondary NameNode is probably more critical than the secondary >> master - since the master is only responsible for metadata changes/region >> splits/table creation etc and not for writes/reads. >> >> Regarding the keys question - i meant that the (row + column) length is >> 24-32 bytes and the value length is 0-1 bytes. Currently, we have a >> cluster >> running with all the data loaded into hbase but it all runs with default >> settings. >> >> Thanks >> Varun >> >> On Tue, Oct 30, 2012 at 10:53 AM, Jean-Marc Spaggiari < >> [EMAIL PROTECTED]> wrote: >> >> > My 2¢. >> > >> > 1) You need an odd number of ZooKeeper nodes. So 3 is the minimum >> > recommanded for production. >> > 2) Yes, you have Master and SecondaryMaster. And it's also recommanded >> > to have one of each. And the master is critical. If you are loosing >> > it, you are loosing your cluster. >> > 3) NameNode is hadoop, not hbase. You should follow hadoop >> > recommandations. Like you have secondarymaster, you have >> > secondarynamenode. So I think you should have as many >> > secondarynamenode as you have secondarymaster (on the same machine?). >> > 4) I'm not sure to understanding this question. Key are binary. Array >> > of bytes. So 32 0-1 bytes is a 3 bytes long array. It's not a lot. >> > This will only give you 2^32 different rows. You will have to >> > pre-split them, or you will end with almost all of them on the same >> > regionserver? >> > >> > JM >> > >> > 2012/10/30, Varun Sharma <[EMAIL PROTECTED]>: >> > > Hi, >> > > >> > > We are planning to experiment with a cluster for serving production >> > traffic >> > > using hbase for pinterest. We are starting off with a 10 region >> server + >> > 1 >> > > master cluster on Amazon EMR version 0.92. I had some very naive >> > questions >> > > (primarily around points of failure): >> > > >> > > 1) It seems hbase starts only one zookeeper on the master node - >> which is >> > > critical for operation - how many zookeepers should I use and can I >> run >> > > those on the region servers ? >> > > 2) How many masters to use - does hbase support multiple masters >> (primary >> > > and secondary) within the same cluster ? From my understanding, master Kevin O'Dell Customer Operations Engineer, Cloudera +
Kevin O'dell 2012-10-30, 19:16
|