|
jaxzin
2010-03-09, 15:45
Barney Frank
2010-03-09, 16:13
jaxzin
2010-03-09, 16:21
Gary Helmling
2010-03-09, 17:05
jaxzin
2010-03-09, 17:29
Charles Woerner
2010-03-09, 18:20
jaxzin
2010-03-09, 19:56
jaxzin
2010-03-09, 20:08
Jonathan Gray
2010-03-09, 22:08
Ryan Rawson
2010-03-09, 22:29
Charles Woerner
2010-03-09, 23:12
Ryan Rawson
2010-03-09, 23:34
charleswoerner@...
2010-03-09, 23:40
Andrew Purtell
2010-03-10, 00:12
Amandeep Khurana
2010-03-10, 00:20
Ryan Rawson
2010-03-10, 00:41
Charles Woerner
2010-03-10, 01:49
Wade Arnold
2010-03-10, 05:02
Hua Su
2010-03-10, 09:01
Andrew Purtell
2010-03-10, 09:36
Hua Su
2010-03-10, 09:57
|
-
Use cases of HBasejaxzin 2010-03-09, 15:45
Hi all, I've got a question about how everyone is using HBase. Is anyone using its as online data store to directly back a web service? The text-book example of a weblink HBase table suggests there would be an associated web front-end to display the information in that HBase table (ex. search results page), but I'm having trouble finding evidence that anyone is servicing web traffic backed directly by an HBase instance in practice. I'm evaluating if HBase would be the right tool to provide a few things for a large-scale web service we want to develop at ESPN and I'd really like to get opinions and experience from people who have already been down this path. No need to reinvent the wheel, right? I can tell you a little about the project goals if it helps give you an idea of what I'm trying to design for: 1) Highly available (It would be a central service and an outage would take down everything) 2) Low latency (1-2 ms, less is better, more isn't acceptable) 3) High throughput (5-10k req/sec at worse case peak) 4) Unstable traffic (ex. Sunday afternoons during football season) 5) Small data...for now (< 10 GB of total data currently, but HBase could allow us to design differently and store more online) The reason I'm looking at HBase is that we've solved many of our scaling issues with the same basic concepts of HBase (sharding, flattening data to fit in one row, throw away ACID, etc) but with home-grown software. I'd like to adopt an active open-source project if it makes sense. Alternatives I'm also looking at: RDBMS fronted with Websphere eXtreme Scale, RDBMS fronted with Hibernate/ehcache, or (the option I understand the least right now) memcached. Thanks, Brian -- View this message in context: http://old.nabble.com/Use-cases-of-HBase-tp27837470p27837470.html Sent from the HBase User mailing list archive at Nabble.com.
-
Re: Use cases of HBaseBarney Frank 2010-03-09, 16:13
I am using Hbase to store visitor level clickstream-like data. At the
beginning of the visitor session I retrieve all the previous session data from hbase and use it within my app server and massage it a little and serve to the consumer via web services. Where I think you will run into the most problems is your latency requirement. Just my 2 cents from a user. On Tue, Mar 9, 2010 at 9:45 AM, jaxzin <[EMAIL PROTECTED]> wrote: > > Hi all, I've got a question about how everyone is using HBase. Is anyone > using its as online data store to directly back a web service? > > The text-book example of a weblink HBase table suggests there would be an > associated web front-end to display the information in that HBase table > (ex. > search results page), but I'm having trouble finding evidence that anyone > is > servicing web traffic backed directly by an HBase instance in practice. > > I'm evaluating if HBase would be the right tool to provide a few things for > a large-scale web service we want to develop at ESPN and I'd really like to > get opinions and experience from people who have already been down this > path. No need to reinvent the wheel, right? > > I can tell you a little about the project goals if it helps give you an > idea > of what I'm trying to design for: > > 1) Highly available (It would be a central service and an outage would take > down everything) > 2) Low latency (1-2 ms, less is better, more isn't acceptable) > 3) High throughput (5-10k req/sec at worse case peak) > 4) Unstable traffic (ex. Sunday afternoons during football season) > 5) Small data...for now (< 10 GB of total data currently, but HBase could > allow us to design differently and store more online) > > The reason I'm looking at HBase is that we've solved many of our scaling > issues with the same basic concepts of HBase (sharding, flattening data to > fit in one row, throw away ACID, etc) but with home-grown software. I'd > like to adopt an active open-source project if it makes sense. > > Alternatives I'm also looking at: RDBMS fronted with Websphere eXtreme > Scale, RDBMS fronted with Hibernate/ehcache, or (the option I understand > the > least right now) memcached. > > Thanks, > Brian > -- > View this message in context: > http://old.nabble.com/Use-cases-of-HBase-tp27837470p27837470.html > Sent from the HBase User mailing list archive at Nabble.com. > >
-
Re: Use cases of HBasejaxzin 2010-03-09, 16:21
This is exactly the kind of feedback I'm looking for thanks, Barney. So its sounds like you cache the data you get from HBase in a session-based memory? Are you using a Java EE HttpSession? (I'm less familiar with django/rails equivalent but I'm assuming they exist) Or are you using a memory cache provider like ehcache or memcache(d)? Can you tell me more about your experience with latency and why you say that? Barney Frank wrote: > > I am using Hbase to store visitor level clickstream-like data. At the > beginning of the visitor session I retrieve all the previous session data > from hbase and use it within my app server and massage it a little and > serve > to the consumer via web services. Where I think you will run into the > most > problems is your latency requirement. > > Just my 2 cents from a user. > > On Tue, Mar 9, 2010 at 9:45 AM, jaxzin <[EMAIL PROTECTED]> wrote: > >> >> Hi all, I've got a question about how everyone is using HBase. Is anyone >> using its as online data store to directly back a web service? >> >> The text-book example of a weblink HBase table suggests there would be an >> associated web front-end to display the information in that HBase table >> (ex. >> search results page), but I'm having trouble finding evidence that anyone >> is >> servicing web traffic backed directly by an HBase instance in practice. >> >> I'm evaluating if HBase would be the right tool to provide a few things >> for >> a large-scale web service we want to develop at ESPN and I'd really like >> to >> get opinions and experience from people who have already been down this >> path. No need to reinvent the wheel, right? >> >> I can tell you a little about the project goals if it helps give you an >> idea >> of what I'm trying to design for: >> >> 1) Highly available (It would be a central service and an outage would >> take >> down everything) >> 2) Low latency (1-2 ms, less is better, more isn't acceptable) >> 3) High throughput (5-10k req/sec at worse case peak) >> 4) Unstable traffic (ex. Sunday afternoons during football season) >> 5) Small data...for now (< 10 GB of total data currently, but HBase could >> allow us to design differently and store more online) >> >> The reason I'm looking at HBase is that we've solved many of our scaling >> issues with the same basic concepts of HBase (sharding, flattening data >> to >> fit in one row, throw away ACID, etc) but with home-grown software. I'd >> like to adopt an active open-source project if it makes sense. >> >> Alternatives I'm also looking at: RDBMS fronted with Websphere eXtreme >> Scale, RDBMS fronted with Hibernate/ehcache, or (the option I understand >> the >> least right now) memcached. >> >> Thanks, >> Brian >> -- >> View this message in context: >> http://old.nabble.com/Use-cases-of-HBase-tp27837470p27837470.html >> Sent from the HBase User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://old.nabble.com/Use-cases-of-HBase-tp27837470p27838006.html Sent from the HBase User mailing list archive at Nabble.com.
-
Re: Use cases of HBaseGary Helmling 2010-03-09, 17:05
Hey Brian,
We use HBase to complement MySQL in serving activity-stream type data here at Meetup. It's handling real-time requests involved in 20-25% of our page views, but our latency requirements aren't as strict as yours. For what it's worth, I did a presentation on our setup which will hopefully fill in some details: http://www.slideshare.net/ghelmling/hbase-at-meetup There are also some great presentations by Ryan Rawson and Jonathan Gray on how they've used HBase for realtime serving on their sites. See the presentations wiki page: http://wiki.apache.org/hadoop/HBase/HBasePresentations Like Barney, I suspect where you'll hit some issues will be in your latency requirements. Depending on how you layout your data and configure your column families, your average latency may be good, but you will hit some pauses as I believe reads block at times during region splits or compactions and memstore flushes (unless you have a fairly static data set). Others here should be able to fill in more details. With a relatively small dataset, you may want to look at the "in memory" configuration option for your column families. What's your expected workload -- writes vs. reads? types of reads you'll be doing: random access vs. sequential? There are a lot of knowledgeable folks here to offer advice if you can give us some more insight into what you're trying to build. --gh On Tue, Mar 9, 2010 at 11:21 AM, jaxzin <[EMAIL PROTECTED]> wrote: > > This is exactly the kind of feedback I'm looking for thanks, Barney. > > So its sounds like you cache the data you get from HBase in a session-based > memory? Are you using a Java EE HttpSession? (I'm less familiar with > django/rails equivalent but I'm assuming they exist) Or are you using a > memory cache provider like ehcache or memcache(d)? > > Can you tell me more about your experience with latency and why you say > that? > > > Barney Frank wrote: > > > > I am using Hbase to store visitor level clickstream-like data. At the > > beginning of the visitor session I retrieve all the previous session data > > from hbase and use it within my app server and massage it a little and > > serve > > to the consumer via web services. Where I think you will run into the > > most > > problems is your latency requirement. > > > > Just my 2 cents from a user. > > > > On Tue, Mar 9, 2010 at 9:45 AM, jaxzin <[EMAIL PROTECTED]> > wrote: > > > >> > >> Hi all, I've got a question about how everyone is using HBase. Is > anyone > >> using its as online data store to directly back a web service? > >> > >> The text-book example of a weblink HBase table suggests there would be > an > >> associated web front-end to display the information in that HBase table > >> (ex. > >> search results page), but I'm having trouble finding evidence that > anyone > >> is > >> servicing web traffic backed directly by an HBase instance in practice. > >> > >> I'm evaluating if HBase would be the right tool to provide a few things > >> for > >> a large-scale web service we want to develop at ESPN and I'd really like > >> to > >> get opinions and experience from people who have already been down this > >> path. No need to reinvent the wheel, right? > >> > >> I can tell you a little about the project goals if it helps give you an > >> idea > >> of what I'm trying to design for: > >> > >> 1) Highly available (It would be a central service and an outage would > >> take > >> down everything) > >> 2) Low latency (1-2 ms, less is better, more isn't acceptable) > >> 3) High throughput (5-10k req/sec at worse case peak) > >> 4) Unstable traffic (ex. Sunday afternoons during football season) > >> 5) Small data...for now (< 10 GB of total data currently, but HBase > could > >> allow us to design differently and store more online) > >> > >> The reason I'm looking at HBase is that we've solved many of our scaling > >> issues with the same basic concepts of HBase (sharding, flattening data > >> to > >> fit in one row, throw away ACID, etc) but with home-grown software. I'd
-
Re: Use cases of HBasejaxzin 2010-03-09, 17:29
Thanks Gary, this is great! I'm designing a central store/service for all user data for the fantasy section of ESPN.com (profile/preferences/record of activity, you name it). The record-of-activity wouldn't be on a page view granularity but more like "created a league" or "won a trophy" type activities. I expect it will be much more read-heavy, at least for the core column families. And since it's user data, I expect it to be randomly accessed, keyed on our internal user IDs. I expect it could be fronted by a public RESTful service that browsers might access directly via Ajax, but our initial usage pattern will most likely be server-side inclusion of the data on the hosts responsible for rendering pages. But even if its only exposed internally, I don't want each client of the data to be aware its backed by HBase and so the store will be fronted by a web or TCP-based service to manage that abstraction layer. Ideally it would be a RESTful service, but if I can't get that to perform I'd be willing to use a higher-performance protocol like Thrift, Google protobuf, etc. If that's not enough info for guiding me, I'll gladly volunteer more. Thanks again. Also to give you some background of what I know already, the reason I'm asking this publicly is that I spoke with an engineer that did a proof of concept with HBase and he found the cluster would tip over if you have more than 4 clients connecting to a regionserver for reads or 1 client/node for writes. And that if a region server failed it corrupts the table in an unrecoverable way. These issues sounded like blockers to me for using HBase in an online, mission-critical way so I figure I'm missing something big. Gary Helmling wrote: > > Hey Brian, > > We use HBase to complement MySQL in serving activity-stream type data here > at Meetup. It's handling real-time requests involved in 20-25% of our > page > views, but our latency requirements aren't as strict as yours. For what > it's worth, I did a presentation on our setup which will hopefully fill in > some details: http://www.slideshare.net/ghelmling/hbase-at-meetup > > There are also some great presentations by Ryan Rawson and Jonathan Gray > on > how they've used HBase for realtime serving on their sites. See the > presentations wiki page: > http://wiki.apache.org/hadoop/HBase/HBasePresentations > > Like Barney, I suspect where you'll hit some issues will be in your > latency > requirements. Depending on how you layout your data and configure your > column families, your average latency may be good, but you will hit some > pauses as I believe reads block at times during region splits or > compactions > and memstore flushes (unless you have a fairly static data set). Others > here should be able to fill in more details. > > With a relatively small dataset, you may want to look at the "in memory" > configuration option for your column families. > > What's your expected workload -- writes vs. reads? types of reads you'll > be > doing: random access vs. sequential? There are a lot of knowledgeable > folks > here to offer advice if you can give us some more insight into what you're > trying to build. > > --gh > > > On Tue, Mar 9, 2010 at 11:21 AM, jaxzin <[EMAIL PROTECTED]> wrote: > >> >> This is exactly the kind of feedback I'm looking for thanks, Barney. >> >> So its sounds like you cache the data you get from HBase in a >> session-based >> memory? Are you using a Java EE HttpSession? (I'm less familiar with >> django/rails equivalent but I'm assuming they exist) Or are you using a >> memory cache provider like ehcache or memcache(d)? >> >> Can you tell me more about your experience with latency and why you say >> that? >> >> >> Barney Frank wrote: >> > >> > I am using Hbase to store visitor level clickstream-like data. At the >> > beginning of the visitor session I retrieve all the previous session >> data >> > from hbase and use it within my app server and massage it a little and >> > serve View this message in context: http://old.nabble.com/Use-cases-of-HBase-tp27837470p27839035.html Sent from the HBase User mailing list archive at Nabble.com.
-
Re: Use cases of HBaseCharles Woerner 2010-03-09, 18:20
Slightly off topic, but we have similar requirements as you and NDBD is
working great. As far as latency goes you can definitely see millisecond or less response times using the NDB api. Your throughput requirements should be a piece of cake as well. 10GB is definitely not "big data" and 1-2 ms is pretty low latency like others have mentioned, so your use case isn't really in the HBase "sweet spot". Not to say that it wouldn't work. On Tue, Mar 9, 2010 at 7:45 AM, jaxzin <[EMAIL PROTECTED]> wrote: > > Hi all, I've got a question about how everyone is using HBase. Is anyone > using its as online data store to directly back a web service? > > The text-book example of a weblink HBase table suggests there would be an > associated web front-end to display the information in that HBase table > (ex. > search results page), but I'm having trouble finding evidence that anyone > is > servicing web traffic backed directly by an HBase instance in practice. > > I'm evaluating if HBase would be the right tool to provide a few things for > a large-scale web service we want to develop at ESPN and I'd really like to > get opinions and experience from people who have already been down this > path. No need to reinvent the wheel, right? > > I can tell you a little about the project goals if it helps give you an > idea > of what I'm trying to design for: > > 1) Highly available (It would be a central service and an outage would take > down everything) > 2) Low latency (1-2 ms, less is better, more isn't acceptable) > 3) High throughput (5-10k req/sec at worse case peak) > 4) Unstable traffic (ex. Sunday afternoons during football season) > 5) Small data...for now (< 10 GB of total data currently, but HBase could > allow us to design differently and store more online) > > The reason I'm looking at HBase is that we've solved many of our scaling > issues with the same basic concepts of HBase (sharding, flattening data to > fit in one row, throw away ACID, etc) but with home-grown software. I'd > like to adopt an active open-source project if it makes sense. > > Alternatives I'm also looking at: RDBMS fronted with Websphere eXtreme > Scale, RDBMS fronted with Hibernate/ehcache, or (the option I understand > the > least right now) memcached. > > Thanks, > Brian > -- > View this message in context: > http://old.nabble.com/Use-cases-of-HBase-tp27837470p27837470.html > Sent from the HBase User mailing list archive at Nabble.com. > > -- --- Thanks, Charles Woerner
-
Re: Use cases of HBasejaxzin 2010-03-09, 19:56
Thanks Charles, I definitely realize that HBase might be a wrong fit because of our current data size, but I'm still interested for the other benefits it provides. And I'm hoping if we use HBase it can be the paradigm shift to keep more data around since we want all that 'record of activity' stuff I mentioned, which we don't currently have. I expect that will be a lot of data and I want to ensure we can continue to grow our userbase without scaling issues. Thanks for your input on the latency, seems like it will need to be my focus during my own research. Charles Woerner-2 wrote: > > Slightly off topic, but we have similar requirements as you and NDBD is > working great. As far as latency goes you can definitely see millisecond > or > less response times using the NDB api. Your throughput requirements > should > be a piece of cake as well. 10GB is definitely not "big data" and 1-2 ms > is > pretty low latency like others have mentioned, so your use case isn't > really > in the HBase "sweet spot". Not to say that it wouldn't work. > > On Tue, Mar 9, 2010 at 7:45 AM, jaxzin <[EMAIL PROTECTED]> wrote: > >> >> Hi all, I've got a question about how everyone is using HBase. Is anyone >> using its as online data store to directly back a web service? >> >> The text-book example of a weblink HBase table suggests there would be an >> associated web front-end to display the information in that HBase table >> (ex. >> search results page), but I'm having trouble finding evidence that anyone >> is >> servicing web traffic backed directly by an HBase instance in practice. >> >> I'm evaluating if HBase would be the right tool to provide a few things >> for >> a large-scale web service we want to develop at ESPN and I'd really like >> to >> get opinions and experience from people who have already been down this >> path. No need to reinvent the wheel, right? >> >> I can tell you a little about the project goals if it helps give you an >> idea >> of what I'm trying to design for: >> >> 1) Highly available (It would be a central service and an outage would >> take >> down everything) >> 2) Low latency (1-2 ms, less is better, more isn't acceptable) >> 3) High throughput (5-10k req/sec at worse case peak) >> 4) Unstable traffic (ex. Sunday afternoons during football season) >> 5) Small data...for now (< 10 GB of total data currently, but HBase could >> allow us to design differently and store more online) >> >> The reason I'm looking at HBase is that we've solved many of our scaling >> issues with the same basic concepts of HBase (sharding, flattening data >> to >> fit in one row, throw away ACID, etc) but with home-grown software. I'd >> like to adopt an active open-source project if it makes sense. >> >> Alternatives I'm also looking at: RDBMS fronted with Websphere eXtreme >> Scale, RDBMS fronted with Hibernate/ehcache, or (the option I understand >> the >> least right now) memcached. >> >> Thanks, >> Brian >> -- >> View this message in context: >> http://old.nabble.com/Use-cases-of-HBase-tp27837470p27837470.html >> Sent from the HBase User mailing list archive at Nabble.com. >> >> > > > -- > --- > Thanks, > > Charles Woerner > > -- View this message in context: http://old.nabble.com/Use-cases-of-HBase-tp27837470p27840557.html Sent from the HBase User mailing list archive at Nabble.com.
-
Re: Use cases of HBasejaxzin 2010-03-09, 20:08
Gary, I looked at your presentation and it was very helpful. But I do have a few unanswered questions from it if you wouldn't mind answering them. How big is/was your cluster that handled 3k req/sec? And what were the specs on each node (RAM/CPU)? When you say latency can be good, what you mean? Is it even in the ballpark of 1 ms? Because we already deal with the GC and don't expect perfect real-time behavior. So that might be okay with me. P.S. I was at Hadoop World NYC and saw Ryan and Jonathan's presentation there but somehow mentally blocked it. Thanks for the reminder. Gary Helmling wrote: > > Hey Brian, > > We use HBase to complement MySQL in serving activity-stream type data here > at Meetup. It's handling real-time requests involved in 20-25% of our > page > views, but our latency requirements aren't as strict as yours. For what > it's worth, I did a presentation on our setup which will hopefully fill in > some details: http://www.slideshare.net/ghelmling/hbase-at-meetup > > There are also some great presentations by Ryan Rawson and Jonathan Gray > on > how they've used HBase for realtime serving on their sites. See the > presentations wiki page: > http://wiki.apache.org/hadoop/HBase/HBasePresentations > > Like Barney, I suspect where you'll hit some issues will be in your > latency > requirements. Depending on how you layout your data and configure your > column families, your average latency may be good, but you will hit some > pauses as I believe reads block at times during region splits or > compactions > and memstore flushes (unless you have a fairly static data set). Others > here should be able to fill in more details. > > With a relatively small dataset, you may want to look at the "in memory" > configuration option for your column families. > > What's your expected workload -- writes vs. reads? types of reads you'll > be > doing: random access vs. sequential? There are a lot of knowledgeable > folks > here to offer advice if you can give us some more insight into what you're > trying to build. > > --gh > > > On Tue, Mar 9, 2010 at 11:21 AM, jaxzin <[EMAIL PROTECTED]> wrote: > >> >> This is exactly the kind of feedback I'm looking for thanks, Barney. >> >> So its sounds like you cache the data you get from HBase in a >> session-based >> memory? Are you using a Java EE HttpSession? (I'm less familiar with >> django/rails equivalent but I'm assuming they exist) Or are you using a >> memory cache provider like ehcache or memcache(d)? >> >> Can you tell me more about your experience with latency and why you say >> that? >> >> >> Barney Frank wrote: >> > >> > I am using Hbase to store visitor level clickstream-like data. At the >> > beginning of the visitor session I retrieve all the previous session >> data >> > from hbase and use it within my app server and massage it a little and >> > serve >> > to the consumer via web services. Where I think you will run into the >> > most >> > problems is your latency requirement. >> > >> > Just my 2 cents from a user. >> > >> > On Tue, Mar 9, 2010 at 9:45 AM, jaxzin <[EMAIL PROTECTED]> >> wrote: >> > >> >> >> >> Hi all, I've got a question about how everyone is using HBase. Is >> anyone >> >> using its as online data store to directly back a web service? >> >> >> >> The text-book example of a weblink HBase table suggests there would be >> an >> >> associated web front-end to display the information in that HBase >> table >> >> (ex. >> >> search results page), but I'm having trouble finding evidence that >> anyone >> >> is >> >> servicing web traffic backed directly by an HBase instance in >> practice. >> >> >> >> I'm evaluating if HBase would be the right tool to provide a few >> things >> >> for >> >> a large-scale web service we want to develop at ESPN and I'd really >> like >> >> to >> >> get opinions and experience from people who have already been down >> this >> >> path. No need to reinvent the wheel, right? >> >> > View this message in context: http://old.nabble.com/Use-cases-of-HBase-tp27837470p27841193.html Sent from the HBase User mailing list archive at Nabble.com.
-
RE: Use cases of HBaseJonathan Gray 2010-03-09, 22:08
Brian,
I would just reiterate what others have said. If you're goal is a consistent 1-2ms read latency and your dataset is on the order of 10GB... HBase is not a good match. It's more than what you need and you'll take unnecessary performance hits. I would look at some of the simpler KV-style stores out there like Tokyo Cabinet, Memcached, or BerkeleyDB, the in-memory ones like Redis. JG -----Original Message----- From: jaxzin [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 09, 2010 12:09 PM To: [EMAIL PROTECTED] Subject: Re: Use cases of HBase Gary, I looked at your presentation and it was very helpful. But I do have a few unanswered questions from it if you wouldn't mind answering them. How big is/was your cluster that handled 3k req/sec? And what were the specs on each node (RAM/CPU)? When you say latency can be good, what you mean? Is it even in the ballpark of 1 ms? Because we already deal with the GC and don't expect perfect real-time behavior. So that might be okay with me. P.S. I was at Hadoop World NYC and saw Ryan and Jonathan's presentation there but somehow mentally blocked it. Thanks for the reminder. Gary Helmling wrote: > > Hey Brian, > > We use HBase to complement MySQL in serving activity-stream type data here > at Meetup. It's handling real-time requests involved in 20-25% of our > page > views, but our latency requirements aren't as strict as yours. For what > it's worth, I did a presentation on our setup which will hopefully fill in > some details: http://www.slideshare.net/ghelmling/hbase-at-meetup > > There are also some great presentations by Ryan Rawson and Jonathan Gray > on > how they've used HBase for realtime serving on their sites. See the > presentations wiki page: > http://wiki.apache.org/hadoop/HBase/HBasePresentations > > Like Barney, I suspect where you'll hit some issues will be in your > latency > requirements. Depending on how you layout your data and configure your > column families, your average latency may be good, but you will hit some > pauses as I believe reads block at times during region splits or > compactions > and memstore flushes (unless you have a fairly static data set). Others > here should be able to fill in more details. > > With a relatively small dataset, you may want to look at the "in memory" > configuration option for your column families. > > What's your expected workload -- writes vs. reads? types of reads you'll > be > doing: random access vs. sequential? There are a lot of knowledgeable > folks > here to offer advice if you can give us some more insight into what you're > trying to build. > > --gh > > > On Tue, Mar 9, 2010 at 11:21 AM, jaxzin <[EMAIL PROTECTED]> wrote: > >> >> This is exactly the kind of feedback I'm looking for thanks, Barney. >> >> So its sounds like you cache the data you get from HBase in a >> session-based >> memory? Are you using a Java EE HttpSession? (I'm less familiar with >> django/rails equivalent but I'm assuming they exist) Or are you using a >> memory cache provider like ehcache or memcache(d)? >> >> Can you tell me more about your experience with latency and why you say >> that? >> >> >> Barney Frank wrote: >> > >> > I am using Hbase to store visitor level clickstream-like data. At the >> > beginning of the visitor session I retrieve all the previous session >> data >> > from hbase and use it within my app server and massage it a little and >> > serve >> > to the consumer via web services. Where I think you will run into the >> > most >> > problems is your latency requirement. >> > >> > Just my 2 cents from a user. >> > >> > On Tue, Mar 9, 2010 at 9:45 AM, jaxzin <[EMAIL PROTECTED]> >> wrote: >> > >> >> >> >> Hi all, I've got a question about how everyone is using HBase. Is >> anyone >> >> using its as online data store to directly back a web service? >> >> >> >> The text-book example of a weblink HBase table suggests there would be >> an >> >> associated web front-end to display the information in that HBase View this message in context: http://old.nabble.com/Use-cases-of-HBase-tp27837470p27841193.html Sent from the HBase User mailing list archive at Nabble.com.
-
Re: Use cases of HBaseRyan Rawson 2010-03-09, 22:29
One thing to note is that 10GB is half the memory of a reasonable
sized machine. In fact I have seen 128 GB memcache boxes out there. As for performance, I obviously feel HBase can be performant for real time queries. To get a consistent response you absolutely have to have 95%+ caching in ram. There is no way to achieve 1-2ms responses from disk. Throwing enough ram at the problem, I think HBase solves this nicely and you won't have to maintain multiple architectures. -ryan On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray <[EMAIL PROTECTED]> wrote: > Brian, > > I would just reiterate what others have said. If you're goal is a > consistent 1-2ms read latency and your dataset is on the order of 10GB... > HBase is not a good match. It's more than what you need and you'll take > unnecessary performance hits. > > I would look at some of the simpler KV-style stores out there like Tokyo > Cabinet, Memcached, or BerkeleyDB, the in-memory ones like Redis. > > JG > > -----Original Message----- > From: jaxzin [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, March 09, 2010 12:09 PM > To: [EMAIL PROTECTED] > Subject: Re: Use cases of HBase > > > Gary, I looked at your presentation and it was very helpful. But I do have > a > few unanswered questions from it if you wouldn't mind answering them. How > big is/was your cluster that handled 3k req/sec? And what were the specs on > each node (RAM/CPU)? > > When you say latency can be good, what you mean? Is it even in the ballpark > of 1 ms? Because we already deal with the GC and don't expect perfect > real-time behavior. So that might be okay with me. > > P.S. I was at Hadoop World NYC and saw Ryan and Jonathan's presentation > there but somehow mentally blocked it. Thanks for the reminder. > > > > Gary Helmling wrote: >> >> Hey Brian, >> >> We use HBase to complement MySQL in serving activity-stream type data here >> at Meetup. It's handling real-time requests involved in 20-25% of our >> page >> views, but our latency requirements aren't as strict as yours. For what >> it's worth, I did a presentation on our setup which will hopefully fill in >> some details: http://www.slideshare.net/ghelmling/hbase-at-meetup >> >> There are also some great presentations by Ryan Rawson and Jonathan Gray >> on >> how they've used HBase for realtime serving on their sites. See the >> presentations wiki page: >> http://wiki.apache.org/hadoop/HBase/HBasePresentations >> >> Like Barney, I suspect where you'll hit some issues will be in your >> latency >> requirements. Depending on how you layout your data and configure your >> column families, your average latency may be good, but you will hit some >> pauses as I believe reads block at times during region splits or >> compactions >> and memstore flushes (unless you have a fairly static data set). Others >> here should be able to fill in more details. >> >> With a relatively small dataset, you may want to look at the "in memory" >> configuration option for your column families. >> >> What's your expected workload -- writes vs. reads? types of reads you'll >> be >> doing: random access vs. sequential? There are a lot of knowledgeable >> folks >> here to offer advice if you can give us some more insight into what you're >> trying to build. >> >> --gh >> >> >> On Tue, Mar 9, 2010 at 11:21 AM, jaxzin <[EMAIL PROTECTED]> wrote: >> >>> >>> This is exactly the kind of feedback I'm looking for thanks, Barney. >>> >>> So its sounds like you cache the data you get from HBase in a >>> session-based >>> memory? Are you using a Java EE HttpSession? (I'm less familiar with >>> django/rails equivalent but I'm assuming they exist) Or are you using a >>> memory cache provider like ehcache or memcache(d)? >>> >>> Can you tell me more about your experience with latency and why you say >>> that? >>> >>> >>> Barney Frank wrote: >>> > >>> > I am using Hbase to store visitor level clickstream-like data. At the >>> > beginning of the visitor session I retrieve all the previous session
-
Re: Use cases of HBaseCharles Woerner 2010-03-09, 23:12
Ryan, your confidence has me interested in exploring HBase a bit further for
some real-time functionality that we're building out. One question about the mem-caching functionality in HBase... Is it write-through or write-back such that all frequently written items are likely in memory, or is it pull-through via a client query? Or would I be relying on lower level caching features of the OS and underlying filesystem? In other words, where there are a high number of both reads and writes, and where 90% of all the reads are on recently (5 minutes) written datums would the HBase architecture help ensure that the most recently written data is already in the cache? On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > One thing to note is that 10GB is half the memory of a reasonable > sized machine. In fact I have seen 128 GB memcache boxes out there. > > As for performance, I obviously feel HBase can be performant for real > time queries. To get a consistent response you absolutely have to > have 95%+ caching in ram. There is no way to achieve 1-2ms responses > from disk. Throwing enough ram at the problem, I think HBase solves > this nicely and you won't have to maintain multiple architectures. > > -ryan > > On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray <[EMAIL PROTECTED]> wrote: > > Brian, > > > > I would just reiterate what others have said. If you're goal is a > > consistent 1-2ms read latency and your dataset is on the order of 10GB... > > HBase is not a good match. It's more than what you need and you'll take > > unnecessary performance hits. > > > > I would look at some of the simpler KV-style stores out there like Tokyo > > Cabinet, Memcached, or BerkeleyDB, the in-memory ones like Redis. > > > > JG > > > > -----Original Message----- > > From: jaxzin [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, March 09, 2010 12:09 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Use cases of HBase > > > > > > Gary, I looked at your presentation and it was very helpful. But I do > have > > a > > few unanswered questions from it if you wouldn't mind answering them. > How > > big is/was your cluster that handled 3k req/sec? And what were the specs > on > > each node (RAM/CPU)? > > > > When you say latency can be good, what you mean? Is it even in the > ballpark > > of 1 ms? Because we already deal with the GC and don't expect perfect > > real-time behavior. So that might be okay with me. > > > > P.S. I was at Hadoop World NYC and saw Ryan and Jonathan's presentation > > there but somehow mentally blocked it. Thanks for the reminder. > > > > > > > > Gary Helmling wrote: > >> > >> Hey Brian, > >> > >> We use HBase to complement MySQL in serving activity-stream type data > here > >> at Meetup. It's handling real-time requests involved in 20-25% of our > >> page > >> views, but our latency requirements aren't as strict as yours. For what > >> it's worth, I did a presentation on our setup which will hopefully fill > in > >> some details: http://www.slideshare.net/ghelmling/hbase-at-meetup > >> > >> There are also some great presentations by Ryan Rawson and Jonathan Gray > >> on > >> how they've used HBase for realtime serving on their sites. See the > >> presentations wiki page: > >> http://wiki.apache.org/hadoop/HBase/HBasePresentations > >> > >> Like Barney, I suspect where you'll hit some issues will be in your > >> latency > >> requirements. Depending on how you layout your data and configure your > >> column families, your average latency may be good, but you will hit some > >> pauses as I believe reads block at times during region splits or > >> compactions > >> and memstore flushes (unless you have a fairly static data set). Others > >> here should be able to fill in more details. > >> > >> With a relatively small dataset, you may want to look at the "in memory" > >> configuration option for your column families. > >> > >> What's your expected workload -- writes vs. reads? types of reads Thanks, Charles Woerner
-
Re: Use cases of HBaseRyan Rawson 2010-03-09, 23:34
HBase operates more like a write-thru cache. Recent writes are in
memory (aka memstore). Older data is in the block cache (by default 20% of Xmx). While you can rely on os buffering, you also want a generous helping of block caching directly in HBase's regionserver. We are seeing great performance, and our 95th percentiles seem to be related to GC pauses. So to answer your use case below, the answer is most decidedly 'yes'. Recent values are in memory, also read from memory as well. -ryan On Tue, Mar 9, 2010 at 3:12 PM, Charles Woerner <[EMAIL PROTECTED]> wrote: > Ryan, your confidence has me interested in exploring HBase a bit further for > some real-time functionality that we're building out. One question about the > mem-caching functionality in HBase... Is it write-through or write-back such > that all frequently written items are likely in memory, or is it > pull-through via a client query? Or would I be relying on lower level > caching features of the OS and underlying filesystem? In other words, where > there are a high number of both reads and writes, and where 90% of all the > reads are on recently (5 minutes) written datums would the HBase > architecture help ensure that the most recently written data is already in > the cache? > > On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > >> One thing to note is that 10GB is half the memory of a reasonable >> sized machine. In fact I have seen 128 GB memcache boxes out there. >> >> As for performance, I obviously feel HBase can be performant for real >> time queries. To get a consistent response you absolutely have to >> have 95%+ caching in ram. There is no way to achieve 1-2ms responses >> from disk. Throwing enough ram at the problem, I think HBase solves >> this nicely and you won't have to maintain multiple architectures. >> >> -ryan >> >> On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray <[EMAIL PROTECTED]> wrote: >> > Brian, >> > >> > I would just reiterate what others have said. If you're goal is a >> > consistent 1-2ms read latency and your dataset is on the order of 10GB... >> > HBase is not a good match. It's more than what you need and you'll take >> > unnecessary performance hits. >> > >> > I would look at some of the simpler KV-style stores out there like Tokyo >> > Cabinet, Memcached, or BerkeleyDB, the in-memory ones like Redis. >> > >> > JG >> > >> > -----Original Message----- >> > From: jaxzin [mailto:[EMAIL PROTECTED]] >> > Sent: Tuesday, March 09, 2010 12:09 PM >> > To: [EMAIL PROTECTED] >> > Subject: Re: Use cases of HBase >> > >> > >> > Gary, I looked at your presentation and it was very helpful. But I do >> have >> > a >> > few unanswered questions from it if you wouldn't mind answering them. >> How >> > big is/was your cluster that handled 3k req/sec? And what were the specs >> on >> > each node (RAM/CPU)? >> > >> > When you say latency can be good, what you mean? Is it even in the >> ballpark >> > of 1 ms? Because we already deal with the GC and don't expect perfect >> > real-time behavior. So that might be okay with me. >> > >> > P.S. I was at Hadoop World NYC and saw Ryan and Jonathan's presentation >> > there but somehow mentally blocked it. Thanks for the reminder. >> > >> > >> > >> > Gary Helmling wrote: >> >> >> >> Hey Brian, >> >> >> >> We use HBase to complement MySQL in serving activity-stream type data >> here >> >> at Meetup. It's handling real-time requests involved in 20-25% of our >> >> page >> >> views, but our latency requirements aren't as strict as yours. For what >> >> it's worth, I did a presentation on our setup which will hopefully fill >> in >> >> some details: http://www.slideshare.net/ghelmling/hbase-at-meetup >> >> >> >> There are also some great presentations by Ryan Rawson and Jonathan Gray >> >> on >> >> how they've used HBase for realtime serving on their sites. See the >> >> presentations wiki page: >> >> http://wiki.apache.org/hadoop/HBase/HBasePresentations
-
Re: Re: Use cases of HBasecharleswoerner@... 2010-03-09, 23:40
That's awesome.
On Mar 9, 2010 3:34pm, Ryan Rawson <[EMAIL PROTECTED]> wrote: > HBase operates more like a write-thru cache. Recent writes are in > memory (aka memstore). Older data is in the block cache (by default > 20% of Xmx). While you can rely on os buffering, you also want a > generous helping of block caching directly in HBase's regionserver. > We are seeing great performance, and our 95th percentiles seem to be > related to GC pauses. > So to answer your use case below, the answer is most decidedly 'yes'. > Recent values are in memory, also read from memory as well. > -ryan > On Tue, Mar 9, 2010 at 3:12 PM, Charles Woerner > [EMAIL PROTECTED]> wrote: > > Ryan, your confidence has me interested in exploring HBase a bit > further for > > some real-time functionality that we're building out. One question > about the > > mem-caching functionality in HBase... Is it write-through or write-back > such > > that all frequently written items are likely in memory, or is it > > pull-through via a client query? Or would I be relying on lower level > > caching features of the OS and underlying filesystem? In other words, > where > > there are a high number of both reads and writes, and where 90% of all > the > > reads are on recently (5 minutes) written datums would the HBase > > architecture help ensure that the most recently written data is already > in > > the cache? > > > > On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson [EMAIL PROTECTED]> wrote: > > > >> One thing to note is that 10GB is half the memory of a reasonable > >> sized machine. In fact I have seen 128 GB memcache boxes out there. > >> > >> As for performance, I obviously feel HBase can be performant for real > >> time queries. To get a consistent response you absolutely have to > >> have 95%+ caching in ram. There is no way to achieve 1-2ms responses > >> from disk. Throwing enough ram at the problem, I think HBase solves > >> this nicely and you won't have to maintain multiple architectures. > >> > >> -ryan > >> > >> On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray [EMAIL PROTECTED]> wrote: > >> > Brian, > >> > > >> > I would just reiterate what others have said. If you're goal is a > >> > consistent 1-2ms read latency and your dataset is on the order of > 10GB... > >> > HBase is not a good match. It's more than what you need and you'll > take > >> > unnecessary performance hits. > >> > > >> > I would look at some of the simpler KV-style stores out there like > Tokyo > >> > Cabinet, Memcached, or BerkeleyDB, the in-memory ones like Redis. > >> > > >> > JG > >> > > >> > -----Original Message----- > >> > From: jaxzin [mailto:[EMAIL PROTECTED]] > >> > Sent: Tuesday, March 09, 2010 12:09 PM > >> > To: [EMAIL PROTECTED] > >> > Subject: Re: Use cases of HBase > >> > > >> > > >> > Gary, I looked at your presentation and it was very helpful. But I do > >> have > >> > a > >> > few unanswered questions from it if you wouldn't mind answering them. > >> How > >> > big is/was your cluster that handled 3k req/sec? And what were the > specs > >> on > >> > each node (RAM/CPU)? > >> > > >> > When you say latency can be good, what you mean? Is it even in the > >> ballpark > >> > of 1 ms? Because we already deal with the GC and don't expect perfect > >> > real-time behavior. So that might be okay with me. > >> > > >> > PS I was at Hadoop World NYC and saw Ryan and Jonathan's presentation > >> > there but somehow mentally blocked it. Thanks for the reminder. > >> > > >> > > >> > > >> > Gary Helmling wrote: > >> >> > >> >> Hey Brian, > >> >> > >> >> We use HBase to complement MySQL in serving activity-stream type > data > >> here > >> >> at Meetup. It's handling real-time requests involved in 20-25% of > our > >> >> page > >> >> views, but our latency requirements aren't as strict as yours. For > what > >> >> it's worth, I did a presentation on our setup which will hopefully
-
Re: Use cases of HBaseAndrew Purtell 2010-03-10, 00:12
I came to this discussion late.
Ryan and J-D's use case is clearly successful. In addition to what others have said, I think another case where HBase really excels is supporting analytics over Big Data (which I define as on the order of petabyte). Some of the best performance numbers are put up by scanners. There is tight integration with the Hadoop MapReduce framework, not only in terms of API support but also with respect to efficient task distribution over the cluster -- moving computation to data -- and there is a favorable interaction with HDFS's location aware data placement. Moving computation to data like that is one major reason how analytics using the MapReduce paradigm can put conventional RDBMS/data warehouses to shame for substantially less cost. Since 0.20.0, results of analytic computations over the data can be materialized and served out in real time in response to queries. This is a complete solution. - Andy ----- Original Message ---- > From: Ryan Rawson <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tue, March 9, 2010 3:34:55 PM > Subject: Re: Use cases of HBase > > HBase operates more like a write-thru cache. Recent writes are in > memory (aka memstore). Older data is in the block cache (by default > 20% of Xmx). While you can rely on os buffering, you also want a > generous helping of block caching directly in HBase's regionserver. > We are seeing great performance, and our 95th percentiles seem to be > related to GC pauses. > > So to answer your use case below, the answer is most decidedly 'yes'. > Recent values are in memory, also read from memory as well. > > -ryan > > On Tue, Mar 9, 2010 at 3:12 PM, Charles Woerner > wrote: > > Ryan, your confidence has me interested in exploring HBase a bit further for > > some real-time functionality that we're building out. One question about the > > mem-caching functionality in HBase... Is it write-through or write-back such > > that all frequently written items are likely in memory, or is it > > pull-through via a client query? Or would I be relying on lower level > > caching features of the OS and underlying filesystem? In other words, where > > there are a high number of both reads and writes, and where 90% of all the > > reads are on recently (5 minutes) written datums would the HBase > > architecture help ensure that the most recently written data is already in > > the cache? > > > > On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson wrote: > > > >> One thing to note is that 10GB is half the memory of a reasonable > >> sized machine. In fact I have seen 128 GB memcache boxes out there. > >> > >> As for performance, I obviously feel HBase can be performant for real > >> time queries. To get a consistent response you absolutely have to > >> have 95%+ caching in ram. There is no way to achieve 1-2ms responses > >> from disk. Throwing enough ram at the problem, I think HBase solves > >> this nicely and you won't have to maintain multiple architectures. > >> > >> -ryan > >> > >> On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray wrote: > >> > Brian, > >> > > >> > I would just reiterate what others have said. If you're goal is a > >> > consistent 1-2ms read latency and your dataset is on the order of 10GB... > >> > HBase is not a good match. It's more than what you need and you'll take > >> > unnecessary performance hits. > >> > > >> > I would look at some of the simpler KV-style stores out there like Tokyo > >> > Cabinet, Memcached, or BerkeleyDB, the in-memory ones like Redis. > >> > > >> > JG > >> > > >> > -----Original Message----- > >> > From: jaxzin [mailto:[EMAIL PROTECTED]] > >> > Sent: Tuesday, March 09, 2010 12:09 PM > >> > To: [EMAIL PROTECTED] > >> > Subject: Re: Use cases of HBase > >> > > >> > > >> > Gary, I looked at your presentation and it was very helpful. But I do > >> have > >> > a > >> > few unanswered questions from it if you wouldn't mind answering them. > >> How > >> > big is/was your cluster that handled 3k req/sec? And what were the specs
-
Re: Use cases of HBaseAmandeep Khurana 2010-03-10, 00:20
Quite a few cases have been discussed already but I'll share my experience
as well. HBase can lend in "ok" in storing adjacency lists for large graphs. Although processing on the stored graph does not necessarily leverage the data locality since different nodes in a node's adjacency list could reside on different physical nodes. You can intelligently partition your graph though. HBase offers the ability to work on large graphs since it can scale more than other graph databases or graph processing engines. At some point we were considering building an RDF triple store over HBase (there is still some steam there but not enough to take it up yet). But as Jonathan said, if you are looking at a data set of the order of 10GB, HBase isnt your best bet. -Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Mar 9, 2010 at 4:12 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > I came to this discussion late. > > Ryan and J-D's use case is clearly successful. > > In addition to what others have said, I think another case where HBase > really excels is supporting analytics over Big Data (which I define as on > the order of petabyte). Some of the best performance numbers are put up by > scanners. There is tight integration with the Hadoop MapReduce framework, > not only in terms of API support but also with respect to efficient task > distribution over the cluster -- moving computation to data -- and there is > a favorable interaction with HDFS's location aware data placement. Moving > computation to data like that is one major reason how analytics using the > MapReduce paradigm can put conventional RDBMS/data warehouses to shame for > substantially less cost. Since 0.20.0, results of analytic computations over > the data can be materialized and served out in real time in response to > queries. This is a complete solution. > > - Andy > > > > ----- Original Message ---- > > From: Ryan Rawson <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Tue, March 9, 2010 3:34:55 PM > > Subject: Re: Use cases of HBase > > > > HBase operates more like a write-thru cache. Recent writes are in > > memory (aka memstore). Older data is in the block cache (by default > > 20% of Xmx). While you can rely on os buffering, you also want a > > generous helping of block caching directly in HBase's regionserver. > > We are seeing great performance, and our 95th percentiles seem to be > > related to GC pauses. > > > > So to answer your use case below, the answer is most decidedly 'yes'. > > Recent values are in memory, also read from memory as well. > > > > -ryan > > > > On Tue, Mar 9, 2010 at 3:12 PM, Charles Woerner > > wrote: > > > Ryan, your confidence has me interested in exploring HBase a bit > further for > > > some real-time functionality that we're building out. One question > about the > > > mem-caching functionality in HBase... Is it write-through or write-back > such > > > that all frequently written items are likely in memory, or is it > > > pull-through via a client query? Or would I be relying on lower level > > > caching features of the OS and underlying filesystem? In other words, > where > > > there are a high number of both reads and writes, and where 90% of all > the > > > reads are on recently (5 minutes) written datums would the HBase > > > architecture help ensure that the most recently written data is already > in > > > the cache? > > > > > > On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson wrote: > > > > > >> One thing to note is that 10GB is half the memory of a reasonable > > >> sized machine. In fact I have seen 128 GB memcache boxes out there. > > >> > > >> As for performance, I obviously feel HBase can be performant for real > > >> time queries. To get a consistent response you absolutely have to > > >> have 95%+ caching in ram. There is no way to achieve 1-2ms responses > > >> from disk. Throwing enough ram at the problem, I think HBase solves > > >> this nicely and you won't have to maintain multiple architectures.
-
Re: Use cases of HBaseRyan Rawson 2010-03-10, 00:41
Thanks for that one andrew. I think a great story is unifying both analytics
and real time on a single platform. This makes dev and ops so much easier. In fact the bigtable paper alludes to this strength. A single data platform for most your needs is powerful. Of course some super speciality needs might require additional platforms. Eg: MySQL for highly relational data. Memcache for high read data. And so on. But It is important from an architecture pov to keep distinct systems count low. On Mar 9, 2010 4:13 PM, "Andrew Purtell" <[EMAIL PROTECTED]> wrote: I came to this discussion late. Ryan and J-D's use case is clearly successful. In addition to what others have said, I think another case where HBase really excels is supporting analytics over Big Data (which I define as on the order of petabyte). Some of the best performance numbers are put up by scanners. There is tight integration with the Hadoop MapReduce framework, not only in terms of API support but also with respect to efficient task distribution over the cluster -- moving computation to data -- and there is a favorable interaction with HDFS's location aware data placement. Moving computation to data like that is one major reason how analytics using the MapReduce paradigm can put conventional RDBMS/data warehouses to shame for substantially less cost. Since 0.20.0, results of analytic computations over the data can be materialized and served out in real time in response to queries. This is a complete solution. - Andy ----- Original Message ---- > From: Ryan Rawson <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]... > Sent: Tue, March 9, 2010 3:34:55 PM > Subject: Re: Use cases of HBase > > HBase operates more like a write-thru cache. Recent writes are in > memory (aka memstore). Older... > wrote: > > Ryan, your confidence has me interested in exploring HBase a bit further for > > some r... > > On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson wrote: > > > >> One thing to note is that 10GB is ha... > >> On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray wrote: > >> > Brian, > >> > > >> > I would just r... > >> wrote: > >> >> > >> >>> > >> >>> This is exactly the kind of feedback I'm looking for thanks, B... > >> >>> wrote: > >> >>> > > >> >>> >> > >> >>> >> Hi all, I've got a question about how everyone is...
-
Re: Use cases of HBaseCharles Woerner 2010-03-10, 01:49
As someone working in the clickstream analytics space right now, I strongly
second this. On Tue, Mar 9, 2010 at 4:41 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > Thanks for that one andrew. I think a great story is unifying both > analytics > and real time on a single platform. This makes dev and ops so much easier. > In fact the bigtable paper alludes to this strength. A single data platform > for most your needs is powerful. > > Of course some super speciality needs might require additional platforms. > Eg: MySQL for highly relational data. Memcache for high read data. And so > on. But It is important from an architecture pov to keep distinct systems > count low. > > On Mar 9, 2010 4:13 PM, "Andrew Purtell" <[EMAIL PROTECTED]> wrote: > > I came to this discussion late. > > Ryan and J-D's use case is clearly successful. > > In addition to what others have said, I think another case where HBase > really excels is supporting analytics over Big Data (which I define as on > the order of petabyte). Some of the best performance numbers are put up by > scanners. There is tight integration with the Hadoop MapReduce framework, > not only in terms of API support but also with respect to efficient task > distribution over the cluster -- moving computation to data -- and there is > a favorable interaction with HDFS's location aware data placement. Moving > computation to data like that is one major reason how analytics using the > MapReduce paradigm can put conventional RDBMS/data warehouses to shame for > substantially less cost. Since 0.20.0, results of analytic computations > over > the data can be materialized and served out in real time in response to > queries. This is a complete solution. > > - Andy > > > > > ----- Original Message ---- > > From: Ryan Rawson <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED]... > > > Sent: Tue, March 9, 2010 3:34:55 PM > > Subject: Re: Use cases of HBase > > > > > HBase operates more like a write-thru cache. Recent writes are in > > memory (aka memstore). Older... > > > wrote: > > > Ryan, your confidence has me interested in exploring HBase a bit > further > for > > > some r... > > > > On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson wrote: > > > > > >> One thing to note is that 10GB is ha... > > > >> On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray wrote: > > >> > Brian, > > >> > > > >> > I would just r... > > > >> wrote: > > >> >> > > >> >>> > > >> >>> This is exactly the kind of feedback I'm looking for thanks, B... > > > >> >>> wrote: > > >> >>> > > > >> >>> >> > > >> >>> >> Hi all, I've got a question about how everyone is... > -- --- Thanks, Charles Woerner
-
Re: Use cases of HBaseWade Arnold 2010-03-10, 05:02
+1
Hbase is part of the hadoop project for a reason even if we are hdfs ugly step child. Hive and Hbase integration is changing how we solve user UI analytics. We use to do massive exports, analytics via map/reduce or pig, and imports from and to hbase. Now that Hive and HBase tables can be used together we are looking to push most of our batch analytics "online" with simple hive queries. http://wiki.apache.org/hadoop/Hive/HBaseIntegration On 3/9/10 7:49 PM, "Charles Woerner" <[EMAIL PROTECTED]> wrote: > As someone working in the clickstream analytics space right now, I strongly > second this. > > On Tue, Mar 9, 2010 at 4:41 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > >> Thanks for that one andrew. I think a great story is unifying both >> analytics >> and real time on a single platform. This makes dev and ops so much easier. >> In fact the bigtable paper alludes to this strength. A single data platform >> for most your needs is powerful. >> >> Of course some super speciality needs might require additional platforms. >> Eg: MySQL for highly relational data. Memcache for high read data. And so >> on. But It is important from an architecture pov to keep distinct systems >> count low. >> >> On Mar 9, 2010 4:13 PM, "Andrew Purtell" <[EMAIL PROTECTED]> wrote: >> >> I came to this discussion late. >> >> Ryan and J-D's use case is clearly successful. >> >> In addition to what others have said, I think another case where HBase >> really excels is supporting analytics over Big Data (which I define as on >> the order of petabyte). Some of the best performance numbers are put up by >> scanners. There is tight integration with the Hadoop MapReduce framework, >> not only in terms of API support but also with respect to efficient task >> distribution over the cluster -- moving computation to data -- and there is >> a favorable interaction with HDFS's location aware data placement. Moving >> computation to data like that is one major reason how analytics using the >> MapReduce paradigm can put conventional RDBMS/data warehouses to shame for >> substantially less cost. Since 0.20.0, results of analytic computations >> over >> the data can be materialized and served out in real time in response to >> queries. This is a complete solution. >> >> - Andy >> >> >> >> >> ----- Original Message ---- >>> From: Ryan Rawson <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED]... >> >>> Sent: Tue, March 9, 2010 3:34:55 PM >>> Subject: Re: Use cases of HBase >>> >> >>> HBase operates more like a write-thru cache. Recent writes are in >>> memory (aka memstore). Older... >> >>> wrote: >>>> Ryan, your confidence has me interested in exploring HBase a bit >> further >> for >>>> some r... >> >>>> On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson wrote: >>>> >>>>> One thing to note is that 10GB is ha... >> >>>>> On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray wrote: >>>>>> Brian, >>>>>> >>>>>> I would just r... >> >>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> This is exactly the kind of feedback I'm looking for thanks, B... >> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi all, I've got a question about how everyone is... >> > >
-
Re: Use cases of HBaseHua Su 2010-03-10, 09:01
Hi Purtel,
What do you mean by "Since 0.20.0, results of analytic computations over the data can be materialized and served out in real time in response to queries."? Here what's the exactly the meaning of "materialized"? Would you kindly give more details? Thanks! - Hua On Wed, Mar 10, 2010 at 8:12 AM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > I came to this discussion late. > > Ryan and J-D's use case is clearly successful. > > In addition to what others have said, I think another case where HBase > really excels is supporting analytics over Big Data (which I define as on > the order of petabyte). Some of the best performance numbers are put up by > scanners. There is tight integration with the Hadoop MapReduce framework, > not only in terms of API support but also with respect to efficient task > distribution over the cluster -- moving computation to data -- and there is > a favorable interaction with HDFS's location aware data placement. Moving > computation to data like that is one major reason how analytics using the > MapReduce paradigm can put conventional RDBMS/data warehouses to shame for > substantially less cost. Since 0.20.0, results of analytic computations over > the data can be materialized and served out in real time in response to > queries. This is a complete solution. > > > - Andy > > > > ----- Original Message ---- > > From: Ryan Rawson <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Tue, March 9, 2010 3:34:55 PM > > Subject: Re: Use cases of HBase > > > > HBase operates more like a write-thru cache. Recent writes are in > > memory (aka memstore). Older data is in the block cache (by default > > 20% of Xmx). While you can rely on os buffering, you also want a > > generous helping of block caching directly in HBase's regionserver. > > We are seeing great performance, and our 95th percentiles seem to be > > related to GC pauses. > > > > So to answer your use case below, the answer is most decidedly 'yes'. > > Recent values are in memory, also read from memory as well. > > > > -ryan > > > > On Tue, Mar 9, 2010 at 3:12 PM, Charles Woerner > > wrote: > > > Ryan, your confidence has me interested in exploring HBase a bit > further for > > > some real-time functionality that we're building out. One question > about the > > > mem-caching functionality in HBase... Is it write-through or write-back > such > > > that all frequently written items are likely in memory, or is it > > > pull-through via a client query? Or would I be relying on lower level > > > caching features of the OS and underlying filesystem? In other words, > where > > > there are a high number of both reads and writes, and where 90% of all > the > > > reads are on recently (5 minutes) written datums would the HBase > > > architecture help ensure that the most recently written data is already > in > > > the cache? > > > > > > On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson wrote: > > > > > >> One thing to note is that 10GB is half the memory of a reasonable > > >> sized machine. In fact I have seen 128 GB memcache boxes out there. > > >> > > >> As for performance, I obviously feel HBase can be performant for real > > >> time queries. To get a consistent response you absolutely have to > > >> have 95%+ caching in ram. There is no way to achieve 1-2ms responses > > >> from disk. Throwing enough ram at the problem, I think HBase solves > > >> this nicely and you won't have to maintain multiple architectures. > > >> > > >> -ryan > > >> > > >> On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray wrote: > > >> > Brian, > > >> > > > >> > I would just reiterate what others have said. If you're goal is a > > >> > consistent 1-2ms read latency and your dataset is on the order of > 10GB... > > >> > HBase is not a good match. It's more than what you need and you'll > take > > >> > unnecessary performance hits. > > >> > > > >> > I would look at some of the simpler KV-style stores out there like > Tokyo > > >> > Cabinet, Memcached, or BerkeleyDB, the in-memory ones like Redis.
-
Re: Use cases of HBaseAndrew Purtell 2010-03-10, 09:36
> Here what's the exactly the meaning of "materialized"? Would you
> kindly give more details? Basically what I am saying is the analytic computation can produce a table of a set of answers to questions which may be asked at some future time. Since HBase 0.20.0, random access to table data is of low enough latency to host the information directly. So, typically a batch process of user construction will run using TableInputFormat over raw data and output cooked results via TableOutputFormat into a table for answering queries later in real time. Depending on the use case this is usually either called precomputation or materialization. Precomputation is a generic term. Materialization (as in "materialized views") I believe was coined by Oracle. These terms are used interchangeably to refer the process of making answers to a set of possible queries in advance. To be pedantic I should have said precomputation instead of materialization, because the latter implies occasional automatic update of the cached data by the database engine. Of course HBase does not do that. Hope that helped, - Andy ----- Original Message ---- > From: Hua Su <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wed, March 10, 2010 1:01:33 AM > Subject: Re: Use cases of HBase > > Hi Purtel, > > What do you mean by "Since 0.20.0, results of analytic computations over the > data can be materialized and served out in real time in response to > queries."? Here what's the exactly the meaning of "materialized"? Would you > kindly give more details? > > Thanks! > > - Hua > > On Wed, Mar 10, 2010 at 8:12 AM, Andrew Purtell wrote: > > > I came to this discussion late. > > > > Ryan and J-D's use case is clearly successful. > > > > In addition to what others have said, I think another case where HBase > > really excels is supporting analytics over Big Data (which I define as on > > the order of petabyte). Some of the best performance numbers are put up by > > scanners. There is tight integration with the Hadoop MapReduce framework, > > not only in terms of API support but also with respect to efficient task > > distribution over the cluster -- moving computation to data -- and there is > > a favorable interaction with HDFS's location aware data placement. Moving > > computation to data like that is one major reason how analytics using the > > MapReduce paradigm can put conventional RDBMS/data warehouses to shame for > > substantially less cost. Since 0.20.0, results of analytic computations over > > the data can be materialized and served out in real time in response to > > queries. This is a complete solution. > > > > > > > > > > - Andy > > > > > > > > ----- Original Message ---- > > > From: Ryan Rawson > > > To: [EMAIL PROTECTED] > > > Sent: Tue, March 9, 2010 3:34:55 PM > > > Subject: Re: Use cases of HBase > > > > > > HBase operates more like a write-thru cache. Recent writes are in > > > memory (aka memstore). Older data is in the block cache (by default > > > 20% of Xmx). While you can rely on os buffering, you also want a > > > generous helping of block caching directly in HBase's regionserver. > > > We are seeing great performance, and our 95th percentiles seem to be > > > related to GC pauses. > > > > > > So to answer your use case below, the answer is most decidedly 'yes'. > > > Recent values are in memory, also read from memory as well. > > > > > > -ryan > > > > > > On Tue, Mar 9, 2010 at 3:12 PM, Charles Woerner > > > wrote: > > > > Ryan, your confidence has me interested in exploring HBase a bit > > further for > > > > some real-time functionality that we're building out. One question > > about the > > > > mem-caching functionality in HBase... Is it write-through or write-back > > such > > > > that all frequently written items are likely in memory, or is it > > > > pull-through via a client query? Or would I be relying on lower level > > > > caching features of the OS and underlying filesystem? In other words,
-
Re: Use cases of HBaseHua Su 2010-03-10, 09:57
thank you for your kind explaination!
On Wed, Mar 10, 2010 at 5:36 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > > Here what's the exactly the meaning of "materialized"? Would you > > kindly give more details? > > Basically what I am saying is the analytic computation can produce a table > of a set of answers to questions which may be asked at some future time. > Since HBase 0.20.0, random access to table data is of low enough latency to > host the information directly. So, typically a batch process of user > construction will run using TableInputFormat over raw data and output cooked > results via TableOutputFormat into a table for answering queries later in > real time. Depending on the use case this is usually either called > precomputation or materialization. Precomputation is a generic term. > Materialization (as in "materialized views") I believe was coined by Oracle. > These terms are used interchangeably to refer the process of making answers > to a set of possible queries in advance. To be pedantic I should have said > precomputation instead of materialization, because the latter implies > occasional automatic update of the cached data by the database engine. Of > course HBase > does not do that. > > Hope that helped, > > - Andy > > > > ----- Original Message ---- > > From: Hua Su <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Wed, March 10, 2010 1:01:33 AM > > Subject: Re: Use cases of HBase > > > > Hi Purtel, > > > > What do you mean by "Since 0.20.0, results of analytic computations over > the > > data can be materialized and served out in real time in response to > > queries."? Here what's the exactly the meaning of "materialized"? Would > you > > kindly give more details? > > > > Thanks! > > > > - Hua > > > > On Wed, Mar 10, 2010 at 8:12 AM, Andrew Purtell wrote: > > > > > I came to this discussion late. > > > > > > Ryan and J-D's use case is clearly successful. > > > > > > In addition to what others have said, I think another case where HBase > > > really excels is supporting analytics over Big Data (which I define as > on > > > the order of petabyte). Some of the best performance numbers are put up > by > > > scanners. There is tight integration with the Hadoop MapReduce > framework, > > > not only in terms of API support but also with respect to efficient > task > > > distribution over the cluster -- moving computation to data -- and > there is > > > a favorable interaction with HDFS's location aware data placement. > Moving > > > computation to data like that is one major reason how analytics using > the > > > MapReduce paradigm can put conventional RDBMS/data warehouses to shame > for > > > substantially less cost. Since 0.20.0, results of analytic computations > over > > > the data can be materialized and served out in real time in response to > > > queries. This is a complete solution. > > > > > > > > > > > > > > > > > - Andy > > > > > > > > > > > > ----- Original Message ---- > > > > From: Ryan Rawson > > > > To: [EMAIL PROTECTED] > > > > Sent: Tue, March 9, 2010 3:34:55 PM > > > > Subject: Re: Use cases of HBase > > > > > > > > HBase operates more like a write-thru cache. Recent writes are in > > > > memory (aka memstore). Older data is in the block cache (by default > > > > 20% of Xmx). While you can rely on os buffering, you also want a > > > > generous helping of block caching directly in HBase's regionserver. > > > > We are seeing great performance, and our 95th percentiles seem to be > > > > related to GC pauses. > > > > > > > > So to answer your use case below, the answer is most decidedly 'yes'. > > > > Recent values are in memory, also read from memory as well. > > > > > > > > -ryan > > > > > > > > On Tue, Mar 9, 2010 at 3:12 PM, Charles Woerner > > > > wrote: > > > > > Ryan, your confidence has me interested in exploring HBase a bit > > > further for > > > > > some real-time functionality that we're building out. One question > > > about the > > > > > mem-caching functionality in HBase... Is it write-through or |