|
Jean-Marc Spaggiari
2012-07-02, 19:04
Amandeep Khurana
2012-07-02, 19:21
Jean-Marc Spaggiari
2012-07-02, 19:53
Amandeep Khurana
2012-07-02, 20:08
Jean-Marc Spaggiari
2012-07-02, 23:48
Amandeep Khurana
2012-07-02, 23:52
Jean-Marc Spaggiari
2012-07-03, 01:56
Michael Segel
2012-07-03, 11:57
Jean-Marc Spaggiari
2012-07-03, 12:31
Michael Segel
2012-07-03, 14:03
|
-
Advices for HTable schemaJean-Marc Spaggiari 2012-07-02, 19:04
Hi,
I have a question regarding the best way to design a table. Let's imagine I want to store all the people in the world on a database. Everyone has a name, last name, phone number, lot of flags (sex, age, etc.). Now, people can have one address, but they can also have 2, or 3, or even more... But they will never have thousands of addresses. Let's say, usually, they have between 1 and 10. My table is designes like that. create 'person', {NAME => 'a', VERSIONS => 1}, {NAME => 'b', VERSIONS => 1, COMPRESSION => 'gz'} The 'a' CF will contain all the informations exepct the address. The 'b' CF will contain only the address. I have few options to store the addresses. I can: - Store in CF 'a' a flag to tell how many addresses there is and store "add1" to "addx" in the 'b' CF will each cell containing the address. - Store in CF 'b' the addresses using an hash as the column identifier. - Store in CF 'b' the addresses as the column identifier and simply put '1' in the cell, or a hash. The first option give me very quick information about the number of addresses, but if I need to add one address, I have to update the 2 CF. Same if I have to remove one. The second option will allow me to add any address even without checking if it's already there. I can remove one very quickly and add one very quickly. If I want to know the number of addresses, I have to retreive all the columns in the CF and count them. However, I'm storing almost the same information twice. One time with the address, one time with the hash (CRC32). The 3rd option has all the advantages of the second one but also, it's not storing the information twice. However, that might result in VERY long column names. And I'm not sure it's good. Like, if I just want to know how many address this person has, I will still need to download them totally on the client side to count them. I'm not able to find which solution I should use. All of them have some pros and cons. And I'm not advanced enought in HBase to forsee the issues I will have later with one format or the other. If I look at the online documentation ( http://hbase.apache.org/book.html#keysize ) it seems the 3rd option is not a good one. So I might have to choose between the 2 first one. Does anyone have any advice/recommandation regarding the best of the 2 formats I should use? Or even maybe there is some other options I have not yet figured? Thanks JM
-
Re: Advices for HTable schemaAmandeep Khurana 2012-07-02, 19:21
Jean-Marc,
These are great questions! Find my answers (and some questions for you) inline. -ak On Monday, July 2, 2012 at 12:04 PM, Jean-Marc Spaggiari wrote: > Hi, > > I have a question regarding the best way to design a table. > > Let's imagine I want to store all the people in the world on a database. > > Everyone has a name, last name, phone number, lot of flags (sex, age, etc.). > > Now, people can have one address, but they can also have 2, or 3, or > even more... But they will never have thousands of addresses. Let's > say, usually, they have between 1 and 10. > > The point to think about here is - what will be your read access pattern? Will you always want the latest address? Or will you want all addresses every time? And then also defining the maximum number of addresses to be stored. > > My table is designes like that. > > create 'person', {NAME => 'a', VERSIONS => 1}, {NAME => 'b', VERSIONS > => 1, COMPRESSION => 'gz'} > > You could easily bump up the versions to a number that limits the max number of addresses you will store. Having two separate column families is not the way to solve this problem in my opinion. Reason being - the concept of column families enables you to isolate data with different access patterns. If that's what you desire here, multiple families make sense. But again, this goes back to defining your read patterns. Are you going to access all the data together or are the addresses going to be accessed independently of the rest of the information. > > The 'a' CF will contain all the informations exepct the address. > The 'b' CF will contain only the address. > > I have few options to store the addresses. > I can: > - Store in CF 'a' a flag to tell how many addresses there is and store > "add1" to "addx" in the 'b' CF will each cell containing the address. > > This sort of becomes a case where you'll need to build a transaction like logic in your client code. When you want to store an additional address, you'll need to do the following: 1. read counter from 'a'. Let's say that is n. 2. store next address with CQ as add[n++] 3. store n++ as the counter That complicates the client code and is undesirable. Moreover, you are accessing both column families at the time of any access to the address info. It is probably better to store the counter in 'b' instead of 'a' in this approach but you still have the complication of the transaction like logic. > - Store in CF 'b' the addresses using an hash as the column identifier. > > The hash doesn't buy you anything. How do you ensure that you are reading the latest address? Again, goes back to defining the read patterns. > - Store in CF 'b' the addresses as the column identifier and simply > put '1' in the cell, or a hash. > Same problem as the last approach. > > The first option give me very quick information about the number of > addresses, but if I need to add one address, I have to update the 2 > CF. Same if I have to remove one. > The second option will allow me to add any address even without > checking if it's already there. I can remove one very quickly and add > one very quickly. If I want to know the number of addresses, I have to > retreive all the columns in the CF and count them. However, I'm > storing almost the same information twice. One time with the address, > one time with the hash (CRC32). > The 3rd option has all the advantages of the second one but also, it's > not storing the information twice. However, that might result in VERY > long column names. And I'm not sure it's good. Like, if I just want to > know how many address this person has, I will still need to download > them totally on the client side to count them. > > Long column qualifiers are perfectly fine and take the same amount of disk space as storing the data in the cells. I don't believe that should be a concern. > > I'm not able to find which solution I should use. All of them have > some pros and cons. And I'm not advanced enought in HBase to forsee
-
Re: Advices for HTable schemaJean-Marc Spaggiari 2012-07-02, 19:53
Hi Amandeep,
Thanks for your prompt reply. I forgot to add that all the addresses are valid at the same time. There is no orders int the addresses. They are all active addresses at the same time. If one is not valid any more, it's removed. If there is a new one, it's added to the list, not replacing any other. So it's not "the last address", but I have to consider all the addresses when I will process them. Regarding adding the address count in the first CF, don't ask me why I have put in in 'a'. I have no clue why I did not tought about adding it to 'b' directly. I agree that it's useless to have it in 'a'. The idea of the hash as a column name was just to have something to put there. It's like the '1' in the second solution. A random number will do the same thing. I'm accessing the data in 2 ways. 1) I acces the person information to update them or retreive all of them to display them 2) I access only the address the compute some statistiques about it. Which mean usually I read ALL the address for one person and not just one address at a time. So basically, there all the 3 options almost the same thing. If I store the number of addresses, I will have more work when I have to add/remove one entry, same amount of work when I want to parse the entries, and less work when I want to count the entries. Difficult choice. I don't find any schema better than the other one because all of them have pros and cons. For now, my prefered one is #1 because it's sound more "natural" to store the number of columns, then parse them by name, etc. but I think I need to think about it a little be more before taking any decision... JM 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED]>: > Jean-Marc, > > These are great questions! Find my answers (and some questions for you) > inline. > > -ak > > > On Monday, July 2, 2012 at 12:04 PM, Jean-Marc Spaggiari wrote: > >> Hi, >> >> I have a question regarding the best way to design a table. >> >> Let's imagine I want to store all the people in the world on a database. >> >> Everyone has a name, last name, phone number, lot of flags (sex, age, >> etc.). >> >> Now, people can have one address, but they can also have 2, or 3, or >> even more... But they will never have thousands of addresses. Let's >> say, usually, they have between 1 and 10. >> >> > > The point to think about here is - what will be your read access pattern? > Will you always want the latest address? Or will you want all addresses > every time? And then also defining the maximum number of addresses to be > stored. >> >> My table is designes like that. >> >> create 'person', {NAME => 'a', VERSIONS => 1}, {NAME => 'b', VERSIONS >> => 1, COMPRESSION => 'gz'} >> >> > > You could easily bump up the versions to a number that limits the max number > of addresses you will store. > Having two separate column families is not the way to solve this problem in > my opinion. Reason being - the concept of column families enables you to > isolate data with different access patterns. If that's what you desire here, > multiple families make sense. But again, this goes back to defining your > read patterns. Are you going to access all the data together or are the > addresses going to be accessed independently of the rest of the > information. >> >> The 'a' CF will contain all the informations exepct the address. >> The 'b' CF will contain only the address. >> >> I have few options to store the addresses. >> I can: >> - Store in CF 'a' a flag to tell how many addresses there is and store >> "add1" to "addx" in the 'b' CF will each cell containing the address. >> >> > > This sort of becomes a case where you'll need to build a transaction like > logic in your client code. When you want to store an additional address, > you'll need to do the following: > 1. read counter from 'a'. Let's say that is n. > 2. store next address with CQ as add[n++] > 3. store n++ as the counter > > That complicates the client code and is undesirable. Moreover, you are > accessing both column families at the time of any access to the address
-
Re: Advices for HTable schemaAmandeep Khurana 2012-07-02, 20:08
Responses inline
On Monday, July 2, 2012 at 12:53 PM, Jean-Marc Spaggiari wrote: > Hi Amandeep, > > Thanks for your prompt reply. > > I forgot to add that all the addresses are valid at the same time. > There is no orders int the addresses. They are all active addresses at > the same time. If one is not valid any more, it's removed. If there is > a new one, it's added to the list, not replacing any other. So it's > not "the last address", but I have to consider all the addresses when > I will process them. > > Regarding adding the address count in the first CF, don't ask me why I > have put in in 'a'. I have no clue why I did not tought about adding > it to 'b' directly. I agree that it's useless to have it in 'a'. > > The idea of the hash as a column name was just to have something to > put there. It's like the '1' in the second solution. A random number > will do the same thing. > > I'm accessing the data in 2 ways. > 1) I acces the person information to update them or retreive all of > them to display them > 2) I access only the address the compute some statistiques about it. > Which mean usually I read ALL the address for one person and not just > one address at a time. > So, that means that the addresses are accessed independently of the other information and you always access all the addresses together? Or does that mean that the addresses are accessed along with the other information to display or retrieve and they are also accessed separately for the stats calculation? You could consider the following ideas: 1. Store everything in 'a' and let all addresses go into the column 'a:address'. Increase the versions to N, where N is the max number of addresses you want to store for any user. OR 2. Store addresses in an entirely different table with the rowkey being user+address. The column qualifier and cell value could be just a simple 1 for the sake of having something there. When you want to get all addresses for a user, you just scan from start key 'user' to end key 'user+1'. I'm not a fan of the first schema option that you outlined earlier because of the complexity involved in the client code. That approach works with relational databases where you have the ability to do transactions. In the HBase world, not so much. > > So basically, there all the 3 options almost the same thing. If I > store the number of addresses, I will have more work when I have to > add/remove one entry, same amount of work when I want to parse the > entries, and less work when I want to count the entries. > > Difficult choice. I don't find any schema better than the other one > because all of them have pros and cons. For now, my prefered one is #1 > because it's sound more "natural" to store the number of columns, then > parse them by name, etc. but I think I need to think about it a little > be more before taking any decision... > > JM > > 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])>: > > Jean-Marc, > > > > These are great questions! Find my answers (and some questions for you) > > inline. > > > > -ak > > > > > > On Monday, July 2, 2012 at 12:04 PM, Jean-Marc Spaggiari wrote: > > > > > Hi, > > > > > > I have a question regarding the best way to design a table. > > > > > > Let's imagine I want to store all the people in the world on a database. > > > > > > Everyone has a name, last name, phone number, lot of flags (sex, age, > > > etc.). > > > > > > Now, people can have one address, but they can also have 2, or 3, or > > > even more... But they will never have thousands of addresses. Let's > > > say, usually, they have between 1 and 10. > > > > > > > > > The point to think about here is - what will be your read access pattern? > > Will you always want the latest address? Or will you want all addresses > > every time? And then also defining the maximum number of addresses to be > > stored. > > > > > > My table is designes like that. > > > > > > create 'person', {NAME => 'a', VERSIONS => 1}, {NAME => 'b', VERSIONS
-
Re: Advices for HTable schemaJean-Marc Spaggiari 2012-07-02, 23:48
Addresses will mainly be accessed independently, and sometime only,
with the other data. I'm not sure either to prefer the "versions" option. So if I go with a 2nd table, does it mean it's better to have more rows than more columns? Here are the 2 options now. Both with a new table. 1) I store the key "personID" and a:a1 to a:an for the addresses. 2) I store the key "personID" + "address In both I will have the same amount of data. In #1 total size will be smaller since the key will be stored only once. In #1 I will have more columns where in #2 I will have more rows. Is there one better than the other one? Also, if I go with option 1, why is it better to have a 2nd table instead of a 2nd column familly? JM 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED]>: > Responses inline > > > On Monday, July 2, 2012 at 12:53 PM, Jean-Marc Spaggiari wrote: > >> Hi Amandeep, >> >> Thanks for your prompt reply. >> >> I forgot to add that all the addresses are valid at the same time. >> There is no orders int the addresses. They are all active addresses at >> the same time. If one is not valid any more, it's removed. If there is >> a new one, it's added to the list, not replacing any other. So it's >> not "the last address", but I have to consider all the addresses when >> I will process them. >> >> Regarding adding the address count in the first CF, don't ask me why I >> have put in in 'a'. I have no clue why I did not tought about adding >> it to 'b' directly. I agree that it's useless to have it in 'a'. >> >> The idea of the hash as a column name was just to have something to >> put there. It's like the '1' in the second solution. A random number >> will do the same thing. >> >> I'm accessing the data in 2 ways. >> 1) I acces the person information to update them or retreive all of >> them to display them >> 2) I access only the address the compute some statistiques about it. >> Which mean usually I read ALL the address for one person and not just >> one address at a time. >> > So, that means that the addresses are accessed independently of the other > information and you always access all the addresses together? Or does that > mean that the addresses are accessed along with the other information to > display or retrieve and they are also accessed separately for the stats > calculation? > > You could consider the following ideas: > > 1. Store everything in 'a' and let all addresses go into the column > 'a:address'. Increase the versions to N, where N is the max number of > addresses you want to store for any user. > > OR > > 2. Store addresses in an entirely different table with the rowkey being > user+address. The column qualifier and cell value could be just a simple 1 > for the sake of having something there. When you want to get all addresses > for a user, you just scan from start key 'user' to end key 'user+1'. > > I'm not a fan of the first schema option that you outlined earlier because > of the complexity involved in the client code. That approach works with > relational databases where you have the ability to do transactions. In the > HBase world, not so much. >> >> So basically, there all the 3 options almost the same thing. If I >> store the number of addresses, I will have more work when I have to >> add/remove one entry, same amount of work when I want to parse the >> entries, and less work when I want to count the entries. >> >> Difficult choice. I don't find any schema better than the other one >> because all of them have pros and cons. For now, my prefered one is #1 >> because it's sound more "natural" to store the number of columns, then >> parse them by name, etc. but I think I need to think about it a little >> be more before taking any decision... >> >> JM >> >> 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])>: >> > Jean-Marc, >> > >> > These are great questions! Find my answers (and some questions for you) >> > inline. >> > >> > -ak >> > >> > >> > On Monday, July 2, 2012 at 12:04 PM, Jean-Marc Spaggiari wrote:
-
Re: Advices for HTable schemaAmandeep Khurana 2012-07-02, 23:52
Inline
On Monday, July 2, 2012 at 4:48 PM, Jean-Marc Spaggiari wrote: > Addresses will mainly be accessed independently, and sometime only, > with the other data. > > I'm not sure either to prefer the "versions" option. So if I go with a > 2nd table, does it mean it's better to have more rows than more > columns? > > Here are the 2 options now. Both with a new table. > > 1) I store the key "personID" and a:a1 to a:an for the addresses. > 2) I store the key "personID" + "address > > In both I will have the same amount of data. In #1 total size will be > smaller since the key will be stored only once. > > The size will be the same. The underlying HFile will store 1 row per cell and the number of cells in both cases is the same. However, the first approach with multiple columns for addresses needs you to keep track of the number and makes updates, deletes, additions complicated as I highlighted earlier. The second option with putting both things in the key makes life much easier. If the data is primarily being accessed independently, I'd go with option 2. > In #1 I will have more > columns where in #2 I will have more rows. > > Is there one better than the other one? Also, if I go with option 1, > why is it better to have a 2nd table instead of a 2nd column familly? > > JM > > 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])>: > > Responses inline > > > > > > On Monday, July 2, 2012 at 12:53 PM, Jean-Marc Spaggiari wrote: > > > > > Hi Amandeep, > > > > > > Thanks for your prompt reply. > > > > > > I forgot to add that all the addresses are valid at the same time. > > > There is no orders int the addresses. They are all active addresses at > > > the same time. If one is not valid any more, it's removed. If there is > > > a new one, it's added to the list, not replacing any other. So it's > > > not "the last address", but I have to consider all the addresses when > > > I will process them. > > > > > > Regarding adding the address count in the first CF, don't ask me why I > > > have put in in 'a'. I have no clue why I did not tought about adding > > > it to 'b' directly. I agree that it's useless to have it in 'a'. > > > > > > The idea of the hash as a column name was just to have something to > > > put there. It's like the '1' in the second solution. A random number > > > will do the same thing. > > > > > > I'm accessing the data in 2 ways. > > > 1) I acces the person information to update them or retreive all of > > > them to display them > > > 2) I access only the address the compute some statistiques about it. > > > Which mean usually I read ALL the address for one person and not just > > > one address at a time. > > > > > > > So, that means that the addresses are accessed independently of the other > > information and you always access all the addresses together? Or does that > > mean that the addresses are accessed along with the other information to > > display or retrieve and they are also accessed separately for the stats > > calculation? > > > > You could consider the following ideas: > > > > 1. Store everything in 'a' and let all addresses go into the column > > 'a:address'. Increase the versions to N, where N is the max number of > > addresses you want to store for any user. > > > > OR > > > > 2. Store addresses in an entirely different table with the rowkey being > > user+address. The column qualifier and cell value could be just a simple 1 > > for the sake of having something there. When you want to get all addresses > > for a user, you just scan from start key 'user' to end key 'user+1'. > > > > I'm not a fan of the first schema option that you outlined earlier because > > of the complexity involved in the client code. That approach works with > > relational databases where you have the ability to do transactions. In the > > HBase world, not so much. > > > > > > So basically, there all the 3 options almost the same thing. If I > > > store the number of addresses, I will have more work when I have to
-
Re: Advices for HTable schemaJean-Marc Spaggiari 2012-07-03, 01:56
2012/7/2, Amandeep Khurana <[EMAIL PROTECTED]>:
>> Here are the 2 options now. Both with a new table. >> >> 1) I store the key "personID" and a:a1 to a:an for the addresses. >> 2) I store the key "personID" + "address >> >> In both I will have the same amount of data. In #1 total size will be >> smaller since the key will be stored only once. >> >> > > The size will be the same. The underlying HFile will store 1 row per cell > and the number of cells in both cases is the same. > > However, the first approach with multiple columns for addresses needs you to > keep track of the number and makes updates, deletes, additions complicated > as I highlighted earlier. The second option with putting both things in the > key makes life much easier. > > If the data is primarily being accessed independently, I'd go with option 2. Oh! I see! My misunderstanding comes from from my lack of HBase knowledge/reflex. I forgot it was storing the data that way. So I think I will most probably give a try to this 2nd option! Thanks for sharing your ideas all over the day. JM
-
Re: Advices for HTable schemaMichael Segel 2012-07-03, 11:57
Hi,
You're over thinking this. Take a step back and remember that you can store anything you want as a byte stream in a column. Literally. So you have a record that could be a text blob. Store it in one column. Use JSON to define its structure and fields. The only thing that makes it difficult is that you will need to pull out everything just to insert or update something. So then maybe segment your data in to logical blocks. Like a column that stores the physical attributes of the person. Another column that stores the list of addresses for the person. Another column that stores the list of aliases used by the person. Don't think in relational terms. HBase isn't relational and ER is not the best way to model in a NoSQL database. Think IMS/COBOL (mainframe) or Dick Pick's Revelation's OS. The only relationships in HBase are weak relationships between tables. Column Families currently have some nasty side effects that you may want to consider how you apply them. Think in terms of records. Look at storing data using Avro. On Jul 2, 2012, at 8:56 PM, Jean-Marc Spaggiari wrote: > 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED]>: >>> Here are the 2 options now. Both with a new table. >>> >>> 1) I store the key "personID" and a:a1 to a:an for the addresses. >>> 2) I store the key "personID" + "address >>> >>> In both I will have the same amount of data. In #1 total size will be >>> smaller since the key will be stored only once. >>> >>> >> >> The size will be the same. The underlying HFile will store 1 row per cell >> and the number of cells in both cases is the same. >> >> However, the first approach with multiple columns for addresses needs you to >> keep track of the number and makes updates, deletes, additions complicated >> as I highlighted earlier. The second option with putting both things in the >> key makes life much easier. >> >> If the data is primarily being accessed independently, I'd go with option 2. > > Oh! I see! My misunderstanding comes from from my lack of HBase > knowledge/reflex. I forgot it was storing the data that way. So I > think I will most probably give a try to this 2nd option! Thanks for > sharing your ideas all over the day. > > JM >
-
Re: Advices for HTable schemaJean-Marc Spaggiari 2012-07-03, 12:31
Hi Michael,
I'm trying to deeply dive into HBase and forget all my RDBMS knowledge but sometime it's difficult to not try to compare and I don't have yet all the right thinking mechanism. The more Amandeep was replying yesterday, more clear it become, but seems I still have a LOT to learn. I will never update one single value from the data I have. I will update all the columns for one row, or not any. When I need to ready them, I usually need to read all of them, or almost all. Not just one. I moved to a multiple columns architecture because I did the application with MySQL first but the more I read, the more I see that it's not the right way. I can have 2 tables. One with a key made with the person ID, and only one single CF and one C with everything into a single cell stored as a JSON output serialized using AVRO like you are suggesting. And a second table with rows ike PERSONID_PERSONADDRESS with a dummy CF and C just to keep one cell. At the end, that will meet all my needs but that will ask a bit more thinking. And it's so far from the initial design! But I think that's definitively a good solution. Thanks! JM 2012/7/3, Michael Segel <[EMAIL PROTECTED]>: > Hi, > > You're over thinking this. > > Take a step back and remember that you can store anything you want as a byte > stream in a column. > Literally. > > So you have a record that could be a text blob. Store it in one column. Use > JSON to define its structure and fields. > > The only thing that makes it difficult is that you will need to pull out > everything just to insert or update something. > So then maybe segment your data in to logical blocks. Like a column that > stores the physical attributes of the person. > Another column that stores the list of addresses for the person. > Another column that stores the list of aliases used by the person. > > Don't think in relational terms. HBase isn't relational and ER is not the > best way to model in a NoSQL database. > Think IMS/COBOL (mainframe) or Dick Pick's Revelation's OS. > > The only relationships in HBase are weak relationships between tables. > Column Families currently have some nasty side effects that you may want to > consider how you apply them. > > Think in terms of records. > > Look at storing data using Avro. > > On Jul 2, 2012, at 8:56 PM, Jean-Marc Spaggiari wrote: > >> 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED]>: >>>> Here are the 2 options now. Both with a new table. >>>> >>>> 1) I store the key "personID" and a:a1 to a:an for the addresses. >>>> 2) I store the key "personID" + "address >>>> >>>> In both I will have the same amount of data. In #1 total size will be >>>> smaller since the key will be stored only once. >>>> >>>> >>> >>> The size will be the same. The underlying HFile will store 1 row per >>> cell >>> and the number of cells in both cases is the same. >>> >>> However, the first approach with multiple columns for addresses needs you >>> to >>> keep track of the number and makes updates, deletes, additions >>> complicated >>> as I highlighted earlier. The second option with putting both things in >>> the >>> key makes life much easier. >>> >>> If the data is primarily being accessed independently, I'd go with option >>> 2. >> >> Oh! I see! My misunderstanding comes from from my lack of HBase >> knowledge/reflex. I forgot it was storing the data that way. So I >> think I will most probably give a try to this 2nd option! Thanks for >> sharing your ideas all over the day. >> >> JM >> > >
-
Re: Advices for HTable schemaMichael Segel 2012-07-03, 14:03
Comparisons are fine.
Try to not think of this in terms of rows and columns, but in terms of records. Think of each record as being atomic. Create a list of all of the components that make up that record. Then combine like components in to structures. Like the Street Address. Add in a couple of fields to suggest when the person lived there. If there is no end date, it must be a current address. You could put them in an Array however array's imply a finite size. Ordered set or list would be more appropriate. Each of these structures then becomes a column. On Jul 3, 2012, at 7:31 AM, Jean-Marc Spaggiari wrote: > Hi Michael, > > I'm trying to deeply dive into HBase and forget all my RDBMS knowledge > but sometime it's difficult to not try to compare and I don't have yet > all the right thinking mechanism. The more Amandeep was replying > yesterday, more clear it become, but seems I still have a LOT to > learn. > > I will never update one single value from the data I have. I will > update all the columns for one row, or not any. When I need to ready > them, I usually need to read all of them, or almost all. Not just one. > I moved to a multiple columns architecture because I did the > application with MySQL first but the more I read, the more I see that > it's not the right way. > > I can have 2 tables. > One with a key made with the person ID, and only one single CF and one > C with everything into a single cell stored as a JSON output > serialized using AVRO like you are suggesting. > And a second table with rows ike PERSONID_PERSONADDRESS with a dummy > CF and C just to keep one cell. > > At the end, that will meet all my needs but that will ask a bit more > thinking. And it's so far from the initial design! But I think that's > definitively a good solution. > > Thanks! > > JM > > 2012/7/3, Michael Segel <[EMAIL PROTECTED]>: >> Hi, >> >> You're over thinking this. >> >> Take a step back and remember that you can store anything you want as a byte >> stream in a column. >> Literally. >> >> So you have a record that could be a text blob. Store it in one column. Use >> JSON to define its structure and fields. >> >> The only thing that makes it difficult is that you will need to pull out >> everything just to insert or update something. >> So then maybe segment your data in to logical blocks. Like a column that >> stores the physical attributes of the person. >> Another column that stores the list of addresses for the person. >> Another column that stores the list of aliases used by the person. >> >> Don't think in relational terms. HBase isn't relational and ER is not the >> best way to model in a NoSQL database. >> Think IMS/COBOL (mainframe) or Dick Pick's Revelation's OS. >> >> The only relationships in HBase are weak relationships between tables. >> Column Families currently have some nasty side effects that you may want to >> consider how you apply them. >> >> Think in terms of records. >> >> Look at storing data using Avro. >> >> On Jul 2, 2012, at 8:56 PM, Jean-Marc Spaggiari wrote: >> >>> 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED]>: >>>>> Here are the 2 options now. Both with a new table. >>>>> >>>>> 1) I store the key "personID" and a:a1 to a:an for the addresses. >>>>> 2) I store the key "personID" + "address >>>>> >>>>> In both I will have the same amount of data. In #1 total size will be >>>>> smaller since the key will be stored only once. >>>>> >>>>> >>>> >>>> The size will be the same. The underlying HFile will store 1 row per >>>> cell >>>> and the number of cells in both cases is the same. >>>> >>>> However, the first approach with multiple columns for addresses needs you >>>> to >>>> keep track of the number and makes updates, deletes, additions >>>> complicated >>>> as I highlighted earlier. The second option with putting both things in >>>> the >>>> key makes life much easier. >>>> >>>> If the data is primarily being accessed independently, I'd go with option >>>> 2. |