|
Em
2012-05-28, 17:50
Michel Segel
2012-05-29, 10:39
shashwat shriparv
2012-05-29, 10:49
Em
2012-05-29, 14:28
Ian Varley
2012-05-29, 15:08
Em
2012-05-29, 15:54
N Keywal
2012-05-29, 16:19
Ian Varley
2012-05-29, 17:49
Em
2012-05-29, 18:24
Ian Varley
2012-05-29, 19:26
Em
2012-05-29, 20:25
Ian Varley
2012-05-29, 23:30
|
-
HBase (BigTable) many to many with students and coursesEm 2012-05-28, 17:50
Hello list,
I have some time now to try out HBase and want to use it for a private project. Questions like "How to I transfer one-to-many or many-to-many relations from my RDBMS's schema to HBase?" seem to be common. I hope we can throw all the best practices that are out there in this thread. As the wiki states: One should create two tables. One for students, another for courses. Within the students' table, one should add one column per selected course with the course_id besides some columns for the student itself (name, birthday, sex etc.). On the other hand one fills the courses table with one column per student_id besides some columns which describe the course itself (name, teacher, begin, end, year, location etc.). So far, so good. How do I access these tables efficiently? A common case would be to show all courses per student. To do so, one has to access the student-table and get all the student's courses-columns. Let's say their names are prefixed ids. One has to remove the prefix and then one accesses the courses-table to get all the courses and their metadata (name, teacher, location etc.). How do I do this kind of operation efficiently? The naive and brute force approach seems to be using a Get-object per course and fetch the neccessary data. Another approach seems to be using the HTable-class and unleash the power of "multigets" by using the batch()-method. All of the information above is theoretically, since I did not used it in code (I currently learn more about the fundamentals of HBase). That's why I give the question to you: How do you do this kind of operation by using HBase? Kind regards, Em +
Em 2012-05-28, 17:50
-
Re: HBase (BigTable) many to many with students and coursesMichel Segel 2012-05-29, 10:39
Depends...
Try looking at a hierarchical model rather than a relational model... One thing to remember is that joins are expensive in HBase. Sent from a remote device. Please excuse any typos... Mike Segel On May 28, 2012, at 12:50 PM, Em <[EMAIL PROTECTED]> wrote: > Hello list, > > I have some time now to try out HBase and want to use it for a private > project. > > Questions like "How to I transfer one-to-many or many-to-many relations > from my RDBMS's schema to HBase?" seem to be common. > > I hope we can throw all the best practices that are out there in this > thread. > > As the wiki states: > One should create two tables. > One for students, another for courses. > > Within the students' table, one should add one column per selected > course with the course_id besides some columns for the student itself > (name, birthday, sex etc.). > > On the other hand one fills the courses table with one column per > student_id besides some columns which describe the course itself (name, > teacher, begin, end, year, location etc.). > > So far, so good. > > How do I access these tables efficiently? > > A common case would be to show all courses per student. > > To do so, one has to access the student-table and get all the student's > courses-columns. > Let's say their names are prefixed ids. One has to remove the prefix and > then one accesses the courses-table to get all the courses and their > metadata (name, teacher, location etc.). > > How do I do this kind of operation efficiently? > The naive and brute force approach seems to be using a Get-object per > course and fetch the neccessary data. > Another approach seems to be using the HTable-class and unleash the > power of "multigets" by using the batch()-method. > > All of the information above is theoretically, since I did not used it > in code (I currently learn more about the fundamentals of HBase). > > That's why I give the question to you: How do you do this kind of > operation by using HBase? > > Kind regards, > Em > +
Michel Segel 2012-05-29, 10:39
-
Re: HBase (BigTable) many to many with students and coursesshashwat shriparv 2012-05-29, 10:49
Check out this link may be it will help you somewhat:
http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies On Tue, May 29, 2012 at 4:09 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > Depends... > Try looking at a hierarchical model rather than a relational model... > > One thing to remember is that joins are expensive in HBase. > > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On May 28, 2012, at 12:50 PM, Em <[EMAIL PROTECTED]> wrote: > > > Hello list, > > > > I have some time now to try out HBase and want to use it for a private > > project. > > > > Questions like "How to I transfer one-to-many or many-to-many relations > > from my RDBMS's schema to HBase?" seem to be common. > > > > I hope we can throw all the best practices that are out there in this > > thread. > > > > As the wiki states: > > One should create two tables. > > One for students, another for courses. > > > > Within the students' table, one should add one column per selected > > course with the course_id besides some columns for the student itself > > (name, birthday, sex etc.). > > > > On the other hand one fills the courses table with one column per > > student_id besides some columns which describe the course itself (name, > > teacher, begin, end, year, location etc.). > > > > So far, so good. > > > > How do I access these tables efficiently? > > > > A common case would be to show all courses per student. > > > > To do so, one has to access the student-table and get all the student's > > courses-columns. > > Let's say their names are prefixed ids. One has to remove the prefix and > > then one accesses the courses-table to get all the courses and their > > metadata (name, teacher, location etc.). > > > > How do I do this kind of operation efficiently? > > The naive and brute force approach seems to be using a Get-object per > > course and fetch the neccessary data. > > Another approach seems to be using the HTable-class and unleash the > > power of "multigets" by using the batch()-method. > > > > All of the information above is theoretically, since I did not used it > > in code (I currently learn more about the fundamentals of HBase). > > > > That's why I give the question to you: How do you do this kind of > > operation by using HBase? > > > > Kind regards, > > Em > > > -- ∞ Shashwat Shriparv +
shashwat shriparv 2012-05-29, 10:49
-
Re: HBase (BigTable) many to many with students and coursesEm 2012-05-29, 14:28
Hi,
thanks for your help. Yes, I know these slides. However I can not find an answer to how to access such schemas efficiently. In case of the given schema for students and courses as in those slides, they say that each column contains the student's id / course's id. However, when you want to build a GUI, you want to get all the courses for a given student and display their names. You *have* the column-names which represent the ids of the courses, however to get the human readable name of a course, you have to access the course-table. I understand the schema, agree with it, but my question was how to access this data efficiently within an application / how to implement the needed behaviour efficiently. Thanks! :) Em Am 29.05.2012 12:49, schrieb shashwat shriparv: > Check out this link may be it will help you somewhat: > > http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies > > On Tue, May 29, 2012 at 4:09 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > >> Depends... >> Try looking at a hierarchical model rather than a relational model... >> >> One thing to remember is that joins are expensive in HBase. >> >> >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On May 28, 2012, at 12:50 PM, Em <[EMAIL PROTECTED]> wrote: >> >>> Hello list, >>> >>> I have some time now to try out HBase and want to use it for a private >>> project. >>> >>> Questions like "How to I transfer one-to-many or many-to-many relations >>> from my RDBMS's schema to HBase?" seem to be common. >>> >>> I hope we can throw all the best practices that are out there in this >>> thread. >>> >>> As the wiki states: >>> One should create two tables. >>> One for students, another for courses. >>> >>> Within the students' table, one should add one column per selected >>> course with the course_id besides some columns for the student itself >>> (name, birthday, sex etc.). >>> >>> On the other hand one fills the courses table with one column per >>> student_id besides some columns which describe the course itself (name, >>> teacher, begin, end, year, location etc.). >>> >>> So far, so good. >>> >>> How do I access these tables efficiently? >>> >>> A common case would be to show all courses per student. >>> >>> To do so, one has to access the student-table and get all the student's >>> courses-columns. >>> Let's say their names are prefixed ids. One has to remove the prefix and >>> then one accesses the courses-table to get all the courses and their >>> metadata (name, teacher, location etc.). >>> >>> How do I do this kind of operation efficiently? >>> The naive and brute force approach seems to be using a Get-object per >>> course and fetch the neccessary data. >>> Another approach seems to be using the HTable-class and unleash the >>> power of "multigets" by using the batch()-method. >>> >>> All of the information above is theoretically, since I did not used it >>> in code (I currently learn more about the fundamentals of HBase). >>> >>> That's why I give the question to you: How do you do this kind of >>> operation by using HBase? >>> >>> Kind regards, >>> Em >>> >> > > > +
Em 2012-05-29, 14:28
-
Re: HBase (BigTable) many to many with students and coursesIan Varley 2012-05-29, 15:08
Em,
What you're describing is a classic relational database nested loop or hash join; the only difference is that relational databases have this feature built in, and can do it very efficiently because they typically run on a single machine, not a distributed cluster. By moving to HBase, you're explicitly making a tradeoff that's worse for this kind of usage, in exchange for having horizontally scalable data storage (i.e. you can scale to TB or PB of data). But the reality is that this makes what you're describing a lot harder to do. A real answer to this question would involve talking a lot about JOIN theory in relational databases: when do optimizers choose nested loop joins vs. hash joins or merge joins? How do you know which side of a join to drive from (HBase doesn't keep stats, nor does it have an optimizer for that matter). There's not really a general "what's the right way to do this", divorced from those kinds of questions. That said, I can see at least a couple ways to make this particular operation (get all courses for one student) efficient in HBase: 1. You could denormalize the additional information (e.g. course name) into the students table. Then, you're simply reading the student row, and all the info you need is there. That places an extra burden of write time and disk space, and does make you do a lot more work when a course name changes. 2. You could do what you're talking about in your HBase access code: find the list of course IDs you need for the student, and do a multi get on the course table. Fundamentally, this won't be much more efficient to do in batch mode, because the courses are likely to be evenly spread out over the region servers (orthogonal to the students). You're essentially doing a hash join, except that it's a lot less pleasant than on a relational DB b/c you've got network round trips for each GET. The disk blocks from the course table (I'm assuming it's the smaller side) will likely be cached so at least that part will be fast--you'll be answering those questions from memory, not via disk IO. 3. You could also let a higher client layer worry about this. For example, your data layer query just returns a student with a list of their course IDs, and then another process in your client code looks up each course by ID to get the name. You can then put an external caching layer (like memcached) in the middle and make things a lot faster (though that does put the burden on you to have the code path for changing course info also flush the relevant cache entries). In your example, it's unlikely any institution would have more than a few thousand courses, so they'd probably all stay in memory and be served instantaneously. This might seem laborious, and to a degree it is. But note that it's difficult to see the utility of HBase with toy examples like this; if you're really storing courses and students, don't use HBase (unless you've got billions of students and courses, which seems unlikely). The extra thought you have to put in to making schemas work for you in HBase is only worth it when it gives you the ability to scale to gigantic data sets where other solutions wouldn't. Ian On May 29, 2012, at 9:28 AM, Em wrote: > Hi, > > thanks for your help. > Yes, I know these slides. > However I can not find an answer to how to access such schemas efficiently. > In case of the given schema for students and courses as in those slides, > they say that each column contains the student's id / course's id. > However, when you want to build a GUI, you want to get all the courses > for a given student and display their names. > You *have* the column-names which represent the ids of the courses, > however to get the human readable name of a course, you have to access > the course-table. > > I understand the schema, agree with it, but my question was how to > access this data efficiently within an application / how to implement > the needed behaviour efficiently. > > Thanks! :) > Em > > Am 29.05.2012 12:49, schrieb shashwat shriparv: +
Ian Varley 2012-05-29, 15:08
-
Re: HBase (BigTable) many to many with students and coursesEm 2012-05-29, 15:54
Ian,
thanks for your detailed response! Let me give you feedback to each point: > 1. You could denormalize the additional information (e.g. course > name) into the students table. Then, you're simply reading the > student row, and all the info you need is there. That places an extra > burden of write time and disk space, and does make you do a lot more > work when a course name changes. That's exactly what I thought about and that's why I avoid it. The students and courses example is an example you find at several points on the web, when describing the differences and translations of relations from an RDBMS into a Key-Value-store. In fact, everything you model with a Key-Value-storage like HBase, Cassandra etc. can be modeled as an RDMBS-scheme. Since a lot of people, like me, are coming from that edge, we must re-learn several basic things. It starts with understanding that you model a K-V-storage the way you want to access the data, not as the data relates to eachother (in general terms) and ends with translating the connections of data into a K-V-schema as good as possible. > 2. You could do what you're talking about in your HBase access code: > find the list of course IDs you need for the student, and do a multi > get on the course table. Fundamentally, this won't be much more > efficient to do in batch mode, because the courses are likely to be > evenly spread out over the region servers (orthogonal to the > students). You're essentially doing a hash join, except that it's a > lot less pleasant than on a relational DB b/c you've got network > round trips for each GET. The disk blocks from the course table (I'm > assuming it's the smaller side) will likely be cached so at least > that part will be fast--you'll be answering those questions from > memory, not via disk IO. Whow, what? I thought a Multiget would reduce network-roundtrips as it only accesses each region *one* time, fetching all the queried keys and values from there. If your data is randomly distributed, this could result in the same costs as with doing several Gets in a loop, but should work better if several Keys are part of the same region. Am I right or did I missunderstood the concept??? > 3. You could also let a higher client layer worry about this. For > example, your data layer query just returns a student with a list of > their course IDs, and then another process in your client code looks > up each course by ID to get the name. You can then put an external > caching layer (like memcached) in the middle and make things a lot > faster (though that does put the burden on you to have the code path > for changing course info also flush the relevant cache entries). In > your example, it's unlikely any institution would have more than a > few thousand courses, so they'd probably all stay in memory and be > served instantaneously. Hm, in what way does this give me an advantage over using HBase - assuming that the number of courses is small enough to fit in RAM - ? I know that Memcached is optimized for this purpose and might have much faster response times - no doubts. However, from a conceptual point of view: Why does Memcached handles the K-V-distribution more efficiently than a HBase with warmed caches? Hopefully this question isn't that hard :). > This might seem laborious, and to a degree it is. But note that it's > difficult to see the utility of HBase with toy examples like this; if > you're really storing courses and students, don't use HBase (unless > you've got billions of students and courses, which seems unlikely). > The extra thought you have to put in to making schemas work for you > in HBase is only worth it when it gives you the ability to scale to > gigantic data sets where other solutions wouldn't. Well, the background is a private project. I know that it's a lot easier to do what I want in a RDBMS and there is no real need for using a highly scalable beast like HBase. However, I want to learn something new and since I do not break someone's business by trying out new technology privately, I want to go with HStack. Without ever doing it, you never get a real feeling of when to use the right tool. Using a good tool for the wrong problem can be an interesting experience, since you learn some of the do's and don'ts of the software you use. Since I am a reader of the MEAP-edition of HBase in Action, I am aware of the TwitBase-example application presented in that book. I am very interested in seeing the author presenting a solution for efficiently accessing the Tweets of the persons I follow. This is an n:m-relation. You got n users with m tweets and each user is seeing his own tweets as well as the tweets of followed persons in descending order by timestamp. This must be done with a join within an RDMBs (and maybe in HBase also), since I can not think of another scalable way of doing so. However, if you do this by a Join, this means that a person with 40.000 followers needs a batch-request consisting of 40.000 GET-objects. That's huge and I bet that this is everything but not fast nor scalable. It sounds like broken by design when designing for Big Data. Therefore I am interested in general best practices for such problems. Maybe this is a better example for showing the possibilities of HBase than a students and courses example. Thanks for sharing your insights! Em Am 29.05.2012 17:08, schrieb Ian Varley: +
Em 2012-05-29, 15:54
-
Re: HBase (BigTable) many to many with students and coursesN Keywal 2012-05-29, 16:19
Hi,
For the multiget, if it's small enough, it will be: - parallelized on all region servers concerned. i.e. you will be as fast as the slowest region server. - there will be one query per region server (i.e. gets are grouped by region server). If there are too many gets, it will be split in small subsets and the strategy above will be used for each subset, doing one subset after another (and blocking between them). so Large set --> Small set will be ok from this point of view. Large --> Large won't. N. On Tue, May 29, 2012 at 5:54 PM, Em <[EMAIL PROTECTED]> wrote: > Ian, > > thanks for your detailed response! > > Let me give you feedback to each point: >> 1. You could denormalize the additional information (e.g. course >> name) into the students table. Then, you're simply reading the >> student row, and all the info you need is there. That places an extra >> burden of write time and disk space, and does make you do a lot more >> work when a course name changes. > That's exactly what I thought about and that's why I avoid it. The > students and courses example is an example you find at several points on > the web, when describing the differences and translations of relations > from an RDBMS into a Key-Value-store. > In fact, everything you model with a Key-Value-storage like HBase, > Cassandra etc. can be modeled as an RDMBS-scheme. > Since a lot of people, like me, are coming from that edge, we must > re-learn several basic things. > It starts with understanding that you model a K-V-storage the way you > want to access the data, not as the data relates to eachother (in > general terms) and ends with translating the connections of data into a > K-V-schema as good as possible. > > >> 2. You could do what you're talking about in your HBase access code: >> find the list of course IDs you need for the student, and do a multi >> get on the course table. Fundamentally, this won't be much more >> efficient to do in batch mode, because the courses are likely to be >> evenly spread out over the region servers (orthogonal to the >> students). You're essentially doing a hash join, except that it's a >> lot less pleasant than on a relational DB b/c you've got network >> round trips for each GET. The disk blocks from the course table (I'm >> assuming it's the smaller side) will likely be cached so at least >> that part will be fast--you'll be answering those questions from >> memory, not via disk IO. > > Whow, what? > I thought a Multiget would reduce network-roundtrips as it only accesses > each region *one* time, fetching all the queried keys and values from > there. If your data is randomly distributed, this could result in the > same costs as with doing several Gets in a loop, but should work better > if several Keys are part of the same region. > Am I right or did I missunderstood the concept??? > >> 3. You could also let a higher client layer worry about this. For >> example, your data layer query just returns a student with a list of >> their course IDs, and then another process in your client code looks >> up each course by ID to get the name. You can then put an external >> caching layer (like memcached) in the middle and make things a lot >> faster (though that does put the burden on you to have the code path >> for changing course info also flush the relevant cache entries). In >> your example, it's unlikely any institution would have more than a >> few thousand courses, so they'd probably all stay in memory and be >> served instantaneously. > Hm, in what way does this give me an advantage over using HBase - > assuming that the number of courses is small enough to fit in RAM - ? > I know that Memcached is optimized for this purpose and might have much > faster response times - no doubts. > However, from a conceptual point of view: Why does Memcached handles the > K-V-distribution more efficiently than a HBase with warmed caches? > Hopefully this question isn't that hard :). > >> This might seem laborious, and to a degree it is. But note that it's > +
N Keywal 2012-05-29, 16:19
-
Re: HBase (BigTable) many to many with students and coursesIan Varley 2012-05-29, 17:49
A few more responses:
On May 29, 2012, at 10:54 AM, Em wrote: > In fact, everything you model with a Key-Value-storage like HBase, > Cassandra etc. can be modeled as an RDMBS-scheme. > Since a lot of people, like me, are coming from that edge, we must > re-learn several basic things. > It starts with understanding that you model a K-V-storage the way you > want to access the data, not as the data relates to eachother (in > general terms) and ends with translating the connections of data into a > K-V-schema as good as possible. Yes, that's a good way of putting it. I did a talk at HBaseCon this year that deals with some of these questions. The video isn't up yet, but the slides are here: http://www.slideshare.net/cloudera/5-h-base-schemahbasecon2012 >> 3. You could also let a higher client layer worry about this. For >> example, your data layer query just returns a student with a list of >> their course IDs, and then another process in your client code looks >> up each course by ID to get the name. You can then put an external >> caching layer (like memcached) in the middle and make things a lot >> faster (though that does put the burden on you to have the code path >> for changing course info also flush the relevant cache entries). > Hm, in what way does this give me an advantage over using HBase - > assuming that the number of courses is small enough to fit in RAM - ? > I know that Memcached is optimized for this purpose and might have much > faster response times - no doubts. > However, from a conceptual point of view: Why does Memcached handles the > K-V-distribution more efficiently than a HBase with warmed caches? > Hopefully this question isn't that hard :). The only architectural advantage I can think of is that reads in HBase still have to check the memstore and all relevant file blocks. So even if that's warm in the HBase cache, it's not quite as lightweight. That said, I guess I was also sort of assuming that table you're looking up into would be the smaller one. The details of which is better here would depend on many things; YMMV. I'd personally go with #2 as a default and them optimize based on real workloads, rather than over engineering it. > Without ever doing it, you never get a real feeling of when to use the > right tool. > Using a good tool for the wrong problem can be an interesting > experience, since you learn some of the do's and don'ts of the software > you use. Very good points. Definitely not trying to discourage learning! :) I just always feel compelled to make that caveat early on, while people are making technology evaluations. You can learn to crochet with a rail gun, but I wouldn't recommend it, unless what you really want to learn is about rail guns, not crocheting. :) Sounds like you really want to learn about rail guns. > Since I am a reader of the MEAP-edition of HBase in Action, I am aware > of the TwitBase-example application presented in that book. > I am very interested in seeing the author presenting a solution for > efficiently accessing the Tweets of the persons I follow. > This is an n:m-relation. > You got n users with m tweets and each user is seeing his own tweets as > well as the tweets of followed persons in descending order by timestamp. > This must be done with a join within an RDMBs (and maybe in HBase also), > since I can not think of another scalable way of doing so. Don't forget about denormalization! You can put copies of the tweets, or at least copies the unique ids of the tweets, into each follower's stream. Yes, that means when Wil Wheaton tweets for the 1000th time about comic con, you get a million copies of the tweet (or ID). But you're trading time & space at write time for extremely fast speeds at write time. Whether this makes sense depends on a zillion other factors, there's no hard & fast rule. > However, if you do this by a Join, this means that a person with 40.000 > followers needs a batch-request consisting of 40.000 GET-objects. That's > huge and I bet that this is everything but not fast nor scalable. It I guess you could say it's "broken by design" (though I'd argue for "unavailable by design" ;). Full join use cases (like you'd do for, say, analytics) don't work well with big data, taking a naive approach. But at least for the twitter app, that's not actually what you're doing. Instead, you've typically got a pattern like *pagination*: you'd never be requesting 40K objects, at most you'd be requesting 20 or 40 objects (a page worth), along some dimension like time. You're optimizing for getting those really fast, with the knowledge that the stream itself is so big, you'd never offer the user a way to act on it in any kind of bulk way. If you want to do processing like that, you do it asynchronously (e.g. via map/reduce). That's actually a really interesting way to see the difference between relational DBs and big data non-relational ones: relational databases promise you easy whole-set operations, and big data nosql databases don't, because they assume the "whole set" will be too big for that. Let's say your product manager for this twitter-like thing said "You know what would be awesome? A widget that shows the average lengths of all tweets in the system, in real time!". And if this product manager was really slick, they might even say "It's easy, look, I even wrote the SQL for it: 'SELECT avg(length(body)) FROM tweets'. I'm a genius." In a SQL database, this query is going to do a full table scan to get this average, every time you ask for it. It would be in HBase, too; the difference is just that HBase makes you be more explicit about it: no simple declarative "SELECT avg ..." syntax, you have to whip out your iterators and see that you're scanning a billion rows to get your average. Would it be nice for HBase to ALSO offer declarative and simple ways to do things like joins, averages, etc? Maybe; but as soon as you dip a toe into this, you kind of have to jump in with both feet. Would you +
Ian Varley 2012-05-29, 17:49
-
Re: HBase (BigTable) many to many with students and coursesEm 2012-05-29, 18:24
Hi Ian,
great to hear your thoughts! Before I am going to give more feedback to your whole post, I want to take your own example and try to get an image of a hbase-approach to do that. > But you're trading time & space at write time for extremely fast > speeds at write time. You ment "extremely fast speeds at read time", don't you? Imagine Wil Wheaton has 1.000.000 followers. One of his tweets triggers 1.000.000 writes (by your model), so that every follower got access to his latest tweets. That needs a high-throughput-model. So far so good - HBase can achieve that. However this means that Sheldon has to do at least two requests to fetch his latest tweets. First: Get the latest columns (aka tweets) of his row and second do a multiget to fetch their content. Okay, that's more than one request but you get what I mean. Am I right? Although I got this little overhead here, does read perform at low latency in a large scale environment? I think that a RDMBS has to do something equal, doesn't it? Maybe it makes sense to have a desing of a key so that all the relevant tweets for a user are placed at the same region? Btw.: Maybe one better does a tall-schema, since row-locking can impact performance a lot for people who follow a lot of other persons but that's not the topic here. Regards, Em Am 29.05.2012 19:49, schrieb Ian Varley: > A few more responses: > > On May 29, 2012, at 10:54 AM, Em wrote: > >> In fact, everything you model with a Key-Value-storage like HBase, >> Cassandra etc. can be modeled as an RDMBS-scheme. >> Since a lot of people, like me, are coming from that edge, we must >> re-learn several basic things. >> It starts with understanding that you model a K-V-storage the way you >> want to access the data, not as the data relates to eachother (in >> general terms) and ends with translating the connections of data into a >> K-V-schema as good as possible. > > Yes, that's a good way of putting it. I did a talk at HBaseCon this year that deals with some of these questions. The video isn't up yet, but the slides are here: > > http://www.slideshare.net/cloudera/5-h-base-schemahbasecon2012 > >>> 3. You could also let a higher client layer worry about this. For >>> example, your data layer query just returns a student with a list of >>> their course IDs, and then another process in your client code looks >>> up each course by ID to get the name. You can then put an external >>> caching layer (like memcached) in the middle and make things a lot >>> faster (though that does put the burden on you to have the code path >>> for changing course info also flush the relevant cache entries). >> Hm, in what way does this give me an advantage over using HBase - >> assuming that the number of courses is small enough to fit in RAM - ? >> I know that Memcached is optimized for this purpose and might have much >> faster response times - no doubts. >> However, from a conceptual point of view: Why does Memcached handles the >> K-V-distribution more efficiently than a HBase with warmed caches? >> Hopefully this question isn't that hard :). > > The only architectural advantage I can think of is that reads in HBase still have to check the memstore and all relevant file blocks. So even if that's warm in the HBase cache, it's not quite as lightweight. That said, I guess I was also sort of assuming that table you're looking up into would be the smaller one. The details of which is better here would depend on many things; YMMV. I'd personally go with #2 as a default and them optimize based on real workloads, rather than over engineering it. > >> Without ever doing it, you never get a real feeling of when to use the >> right tool. >> Using a good tool for the wrong problem can be an interesting >> experience, since you learn some of the do's and don'ts of the software >> you use. > > Very good points. Definitely not trying to discourage learning! :) I just always feel compelled to make that caveat early on, while people are making technology evaluations. You can learn to crochet with a rail gun, but I wouldn't recommend it, unless what you really want to learn is about rail guns, not crocheting. :) Sounds like you really want to learn about rail guns. +
Em 2012-05-29, 18:24
-
Re: HBase (BigTable) many to many with students and coursesIan Varley 2012-05-29, 19:26
On May 29, 2012, at 1:24 PM, Em wrote: >> But you're trading time & space at write time for extremely fast >> speeds at write time. > You ment "extremely fast speeds at read time", don't you? Ha, yes, thanks. That's what I meant. > However this means that Sheldon has to do at least two requests to fetch > his latest tweets. > First: Get the latest columns (aka tweets) of his row and second do a > multiget to fetch their content. Okay, that's more than one request but > you get what I mean. Am I right? Yup, unless you denormalize the tweet bodies as well--then you just read the current user's record and you have everything you need (with the downside of massive data duplication). > Although I got this little overhead here, does read perform at low > latency in a large scale environment? I think that a RDMBS has to do > something equal, doesn't it? Yes, but the relational DB has it all on one node, whereas in a distributed database, it's as many RPC calls as you have nodes, or something on the order of that (see Nicolas's explanation, which is better than mine). The key difference is that that read access time for HBase is still constant when you have PB of data (whereas with a PB of data, the relational DB has long since fallen over). The underlying reason for that is that HBase basically took the top level of the seek operation (which is a fast memory access into the top node of a B+ tree, if you're talking about an RDBMS on one machine) and made it a cross machine lookup ("which server would have this data?"). So it's fundamentally a lot slower than an RDBMS, but still constant w/r/t your overall data size. If you remember your algorithms class, constant factors fall out when N gets larger, and you only care about the big O. :) > Maybe it makes sense to have a desing of a key so that all the relevant > tweets for a user are placed at the same region? Again, that works in a denormalized sense, where you lead each row key with the current user (the tweet recipient) and copy the tweets there. If you are saying that you'd somehow find a magic formula so that people who follow each other happen to be on the same region server, you can forget it. Facebook and Twitter have tried partitioning schemes like that for their social graph data, and found there's no magic boundaries, even countries. (I know, citation needed, and I can't find it just now, but I'm sure I read that somewhere. :) But it makes sense; I can follow anybody, anybody can follow me, so on average my followers and followees would be randomly sprayed across all region servers. I either denormalize stuff (so my row contains everything I need) or I commit to doing GETs to all other region servers when I need that data. And if I really need it to be low latency, it's not a winning bet to have every read spray across a lot of servers like that. (That's just my opinion, perhaps I'm wrong about that.) > Maybe one better does a tall-schema, since row-locking can impact > performance a lot for people who follow a lot of other persons but > that's not the topic here. Yes, very true. We also haven't broached the subject of how long writes would take if you denormalize, and whether they should be asynchronous (for everyone, or just for the Wil Wheatons). If you put all of my incoming tweets into a single row for me, that's potentially lock-wait city. Twitter has a follow limit, maybe that's why. :) Ian +
Ian Varley 2012-05-29, 19:26
-
Re: HBase (BigTable) many to many with students and coursesEm 2012-05-29, 20:25
Hi Ian,
answers between the lines: Am 29.05.2012 21:26, schrieb Ian Varley: >> However this means that Sheldon has to do at least two requests to fetch >> his latest tweets. >> First: Get the latest columns (aka tweets) of his row and second do a >> multiget to fetch their content. Okay, that's more than one request but >> you get what I mean. Am I right? > > Yup, unless you denormalize the tweet bodies as well--then you just read the current user's record and you have everything you need (with the downside of massive data duplication). Well, I think this would be bad practice for editable stuff like tweets. They can be deleted, updated etc.. Furthermore at some point the data duplication will come at its expense - big data or not. Do you aggree? >> Although I got this little overhead here, does read perform at low >> latency in a large scale environment? I think that a RDMBS has to do >> something equal, doesn't it? > > Yes, but the relational DB has it all on one node, whereas in a distributed database, it's as many RPC calls as you have nodes, or something on the order of that (see Nicolas's explanation, which is better than mine). Nicolas..? :) > The key difference is that that read access time for HBase is still constant when you have PB of data (whereas with a PB of data, the relational DB has long since fallen over). The underlying reason for that is that HBase basically took the top level of the seek operation (which is a fast memory access into the top node of a B+ tree, if you're talking about an RDBMS on one machine) and made it a cross machine lookup ("which server would have this data?"). So it's fundamentally a lot slower than an RDBMS, but still constant w/r/t your overall data size. If you remember your algorithms class, constant factors fall out when N gets larger, and you only care about the big O. :) Yes :) I am interested in the underlying algorithms of HBase's hash-algorithms. I think it's an interesting approach. >> Maybe it makes sense to have a desing of a key so that all the relevant >> tweets for a user are placed at the same region? > > Again, that works in a denormalized sense, where you lead each row key with the current user (the tweet recipient) and copy the tweets there. If you are saying that you'd somehow find a magic formula so that people who follow each other happen to be on the same region server, you can forget it. Facebook and Twitter have tried partitioning schemes like that for their social graph data, and found there's no magic boundaries, even countries. (I know, citation needed, and I can't find it just now, but I'm sure I read that somewhere. :) But it makes sense; I can follow anybody, anybody can follow me, so on average my followers and followees would be randomly sprayed across all region servers. I either denormalize stuff (so my row contains everything I need) or I commit to doing GETs to all other region servers when I need that data. And if I really need it to be low latency, it's not a winning bet to have every read spray across a lot of servers like that. (That's just my opinion, perha p s I'm wrong about that.) I think we missunderstood eachother. If we go down the line of data duplication, why not generate keys in a way that all of the user's tweet-stream's tweets end up at the same region server? This way, doing a multiget you only call one or two region servers for all your tweets. >> Maybe one better does a tall-schema, since row-locking can impact >> performance a lot for people who follow a lot of other persons but >> that's not the topic here. > > Yes, very true. We also haven't broached the subject of how long writes would take if you denormalize, and whether they should be asynchronous (for everyone, or just for the Wil Wheatons). If you put all of my incoming tweets into a single row for me, that's potentially lock-wait city. Twitter has a follow limit, maybe that's why. :) Oh, do they? However, as far as I know they are using MySQL for their tweets. So, to generalize our results: There are two majore approaches to design data streams which are coming from n users and were streamed by m users (with m beeing egual or larger than n). One approach is to create an index of the interesting data per user where each column represents the key to the information of interest (wide-schema) or each row which is associated by a key-prefix with the user represents a pointer to the data (tall-schema). If you want to access that data one has to design a HBase-side join. On the other hand the second approach is to make usage of massive data duplication. That means writing the same stuff to every user so that every user is able to access the data immediatly without the need of multigets (given that one uses a wide-schema-design). This saves requests and latency at the cost of writes, througput and network traffic. Are there other approaches for designing data-streams in HBase? Do you aggree with that? HBase seems to be kind of over-engineering, if you do not need that scale. Regards, Em +
Em 2012-05-29, 20:25
-
Re: HBase (BigTable) many to many with students and coursesIan Varley 2012-05-29, 23:30
On May 29, 2012, at 3:25 PM, Em wrote: Yup, unless you denormalize the tweet bodies as well--then you just read the current user's record and you have everything you need (with the downside of massive data duplication). Well, I think this would be bad practice for editable stuff like tweets. They can be deleted, updated etc.. Furthermore at some point the data duplication will come at its expense - big data or not. Do you aggree? You're probably right, yes. For the real Twitter, denormalizing every tweet to every follower would likely be ruinously expensive. For something like a message-based app (like, say, chat), it's much more plausible because there's a fixed and limited number of participants who'd see any given message. The important point here, though, is the "YMMV" part ("Your Mileage May Vary", for those not familiar with the idiom). This kind of stuff is exceedingly difficult to give easy, general solutions to because every case is different. How many people does the average user follow? How big are the message bodies? What's the ratio of CPUs to disk spindles? Etc. Yes, but the relational DB has it all on one node, whereas in a distributed database, it's as many RPC calls as you have nodes, or something on the order of that (see Nicolas's explanation, which is better than mine). Nicolas..? :) Ah, sorry, I mean "N Keywal" who replied upthread. His name is Nicolas. :) His response was about under what situations the GET requests would be batched or sent separately. The key difference is that that read access time for HBase is still constant when you have PB of data (whereas with a PB of data, the relational DB has long since fallen over). The underlying reason for that is that HBase basically took the top level of the seek operation (which is a fast memory access into the top node of a B+ tree, if you're talking about an RDBMS on one machine) and made it a cross machine lookup ("which server would have this data?"). So it's fundamentally a lot slower than an RDBMS, but still constant w/r/t your overall data size. If you remember your algorithms class, constant factors fall out when N gets larger, and you only care about the big O. :) Yes :) I am interested in the underlying algorithms of HBase's hash-algorithms. I think it's an interesting approach. It has no hash-join algorithms built in, if that's what you mean. But you can read the client access code path, and see the anatomy of a scan--how it finds out which regions to hit, creates requests to all of them, combines the results, etc. If you mean the overall approach of HBase, as contrasted with the B+ trees used by relational databases: this is based on the ideas of "Log Structured Merge Trees"; there original paper on it is here: http://scholar.google.com/scholar?cluster=5832040552580693098&hl=en&as_sdt=0,5&as_vis=1 If we go down the line of data duplication, why not generate keys in a way that all of the user's tweet-stream's tweets end up at the same region server? This way, doing a multiget you only call one or two region servers for all your tweets. What would be an example of this? Say we have tweets from 4 users. The normalized "tweet" table might have data like: TweetID Time User Body --------- ---- ---- ----------------- 123 T1 A Hello world 234 T2 B OMG BIEBER! 345 T3 C RT @B OMG BIEBER! 456 T4 A I hate twitter 567 T5 B ZOMG MORE BIEBER! etc. If we denorm just the IDs by follower (let's say everyone follows everyone here, and you don't see your own tweets), we'd get: Follower Time Followee TweetID -------- ---- --------- ----------- A T2 B 234 A T3 C 345 A T5 B 567 B T1 A 123 B T3 C 345 B T4 A 456 C T1 A 123 C T2 B 234 C T4 A 456 C T5 B 567 Here, if I know who I am (the follower) I can easily scan a time range of all the tweets I should see, and it'll almost always be in one region (with the rare exception where it happens to fall over the region server break, which would be handled transparently by the client of course). So if I'm user C, and I scan from T1 to T5, I'd get (A,123), (B,234), (A,456), and (B,567) back. Of course, I then want to do a GET for the body of each tweet. You're asking, why not organize the tweet table in such a way as to make it more likely that these 4 tweets will be co-located? The answer to that in the generic case is that you can't; the IDs of the tweets are completely orthogonal to the followers & followees, so you can't make any expectations about that kind of colocation. It looks easy with 3 users, but imagine there are hundreds of millions of users; what are the chances I follow people whose tweets happen to be co-located? In this case, we could do a little better by making the tweet ID into a UUID with a time component, such that tweets around the same time would be grouped together in the table. The problem with this, of course, is that now all insert activity is going to "hot spot" one area of the table (i.e. one region server). So what you really want is sort of the opposite; you want all the incoming traffic to be nicely distributed so the write load is spread out. Thus, you actually probably don't want the tweets you'd see in a given page to be co-located on the same server, unless it's specifically YOUR copy of the tweets (which gets back to the denormalization question above). As you can see, there's no "easy" answer here either. This is why you get paid the big bucks. :) Yes, very true. We also haven't broached the subject of how long writes would take if you denormalize, and whether they should be asynchronous (for everyone, or just for the Wil Wheatons). If you put all of my incoming tweets into a single row for me, that's potentially lock-wait city. Twitter has a follow limit, maybe that's why. :) Oh, do they? However, as f +
Ian Varley 2012-05-29, 23:30
|