|
Simon Felix
2011-07-13, 10:02
Flavio Junqueira
2011-07-13, 12:01
Simon Felix
2011-07-13, 12:15
Ted Dunning
2011-07-13, 16:52
Simon Felix
2011-07-13, 18:13
Scott Fines
2011-07-13, 18:25
Yang
2011-07-13, 17:17
Yang
2011-07-13, 17:21
Ted Dunning
2011-07-13, 17:31
Yang
2011-07-13, 17:37
Ted Dunning
2011-07-13, 17:55
Simon Felix
2011-07-13, 18:01
Yang
2011-07-13, 18:03
|
-
Shared block storage via ZooKepperSimon Felix 2011-07-13, 10:02
Hello everyone
What is the best way to build a distributed, shared storage system on top of ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. store billions of 4k blocks). Consistency and Availability are important, as is throughput (both read & write). I need at least 50 MB/s with 3 nodes with two regular SATA drives each for my application. Some options I came up with: 1. Use ZooKeeper directly as a data store (Not recommended according to the docs - and it really leads to abysmally bad performance, I tested that) 2. Use Cassandra as data store 3. Use BookKeeper as write-ahead log and implement my own underlying store 4. Use ZooKeeper to create my own (probably buggy...) data store What would you recommend? Are there other options? Cheers, Simon +
Simon Felix 2011-07-13, 10:02
-
Re: Shared block storage via ZooKepperFlavio Junqueira 2011-07-13, 12:01
Hi Simon, It is not entirely clear to me what you need zookeeper for
in this case. Are blocks replicated and you need to guarantee that the updates are consistent across replicas? On your observations, I'm quite sure people will have an opinion, so here are my thoughts, which might not be representative of the whole community : 1- You're right, we do not recommended to use ZooKeeper directly as the data store. ZooKeeper servers keep their state in memory. 2- Cassandra already provides replication. Are you trying to strengthen the guarantees of Cassandra? I don't get it... 3- Sound right that you could use BK as a journal, but it is not clear which element is writing to the journal. Are you assuming a metadata manager such as the namenode of HDFS? 4- I'm not sure what this option means. Are you proposing ZooKeeper to manage the metadata of the file system? If so, I don't find it entirely unrealistic, since metadata updates are supposed to be small and the performance of ZooKeeper should be good enough for your case, but it might be awkward to have your block storage clients talking directly to ZooKeeper. Changes to metadata management would imply in this case rolling out a new version of the client application instead of just having the changes implemented on the service side. -Flavio On Jul 13, 2011, at 12:02 PM, Simon Felix wrote: > Hello everyone > > What is the best way to build a distributed, shared storage system > on top of > ZooKeeper? I'm talking about block storage in the terabyte-range > (i.e. store > billions of 4k blocks). Consistency and Availability are important, > as is > throughput (both read & write). I need at least 50 MB/s with 3 nodes > with > two regular SATA drives each for my application. > > Some options I came up with: > 1. Use ZooKeeper directly as a data store (Not recommended according > to the > docs - and it really leads to abysmally bad performance, I tested > that) > 2. Use Cassandra as data store > 3. Use BookKeeper as write-ahead log and implement my own underlying > store > 4. Use ZooKeeper to create my own (probably buggy...) data store > > What would you recommend? Are there other options? > > Cheers, > Simon flavio junqueira research scientist [EMAIL PROTECTED] direct +34 93-183-8828 avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301 +
Flavio Junqueira 2011-07-13, 12:01
-
RE: Shared block storage via ZooKepperSimon Felix 2011-07-13, 12:15
Thanks for the reply. I'll try to clarify my question a bit. I want to
simulate a single, fault-tolerant shared block storage device. This means everything should be replicated and consistent. All that system manages is (for example) one billion blocks, each containing exactly 4096 bytes. I do not need any metadata per block or locking. There will be multiple nodes, all reading and writing the data concurrently. If two nodes A and B write to the same block concurrently I expect that all nodes have either version A or version B of the block afterwards. I'm not sure which of the option is the easiest to implement and which will give me the highest performance. #2: Cassandra: Would you store the data in multiple rows? Columns? How much data per column? I should probably ask the Cassandra people about this... #3: BookKeeper: Every node is writing to the data. I'd use BookKeeper as write-ahead log. Was BookKeeper built for that kind of workload? Has anyone else done something similar? I couldn't find anything in the archives... Simon From: Flavio Junqueira [mailto:[EMAIL PROTECTED]] Sent: Mittwoch, 13. Juli 2011 14:01 To: [EMAIL PROTECTED] Subject: Re: Shared block storage via ZooKepper Hi Simon, It is not entirely clear to me what you need zookeeper for in this case. Are blocks replicated and you need to guarantee that the updates are consistent across replicas? On your observations, I'm quite sure people will have an opinion, so here are my thoughts, which might not be representative of the whole community : 1- You're right, we do not recommended to use ZooKeeper directly as the data store. ZooKeeper servers keep their state in memory. 2- Cassandra already provides replication. Are you trying to strengthen the guarantees of Cassandra? I don't get it... 3- Sound right that you could use BK as a journal, but it is not clear which element is writing to the journal. Are you assuming a metadata manager such as the namenode of HDFS? 4- I'm not sure what this option means. Are you proposing ZooKeeper to manage the metadata of the file system? If so, I don't find it entirely unrealistic, since metadata updates are supposed to be small and the performance of ZooKeeper should be good enough for your case, but it might be awkward to have your block storage clients talking directly to ZooKeeper. Changes to metadata management would imply in this case rolling out a new version of the client application instead of just having the changes implemented on the service side. -Flavio On Jul 13, 2011, at 12:02 PM, Simon Felix wrote: Hello everyone What is the best way to build a distributed, shared storage system on top of ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. store billions of 4k blocks). Consistency and Availability are important, as is throughput (both read & write). I need at least 50 MB/s with 3 nodes with two regular SATA drives each for my application. Some options I came up with: 1. Use ZooKeeper directly as a data store (Not recommended according to the docs - and it really leads to abysmally bad performance, I tested that) 2. Use Cassandra as data store 3. Use BookKeeper as write-ahead log and implement my own underlying store 4. Use ZooKeeper to create my own (probably buggy...) data store What would you recommend? Are there other options? Cheers, Simon flavio junqueira research scientist [EMAIL PROTECTED] direct +34 93-183-8828 avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301 +
Simon Felix 2011-07-13, 12:15
-
Re: Shared block storage via ZooKepperTed Dunning 2011-07-13, 16:52
Simon,
What you are describing is (roughly) a general read-write distributed and replicated file system. This is a hard problem if you want high performance, absolute consistency and significant amounts of failure tolerance. Building such a system from scratch is a difficult proposition. Frankly, it also sounds just like the filesystem component of MapR (conflict alert, I work for MapR Technologies). You may have additional constraints on what you are looking for, but to meet the requirements that you have already stated, you should take a look at our offering. I can imagine scenarios where this wouldn't be satisfactory, particularly if this is a homework assignment, but if you are simply trying to solve a real engineering problem, it should do very well. I don't want to hijack this list with non-Zookeeper discussion so feel free to contact me directly for more pointers. Ohh... I should mention MapR uses Zookeeper prominently and is glad to do so. The strictness and durability of ZK are ideal as the last resort determinant of coordination. In many areas of our system, the ZK trade-offs are not appropriate, especially where speed is critical, but that isn't what ZK was designed to do. Using ZK appropriately gives extremely good results. On Wed, Jul 13, 2011 at 5:15 AM, Simon Felix <[EMAIL PROTECTED]> wrote: > Thanks for the reply. I’ll try to clarify my question a bit. I want to > simulate a single, fault-tolerant shared block storage device. This means > everything should be replicated and consistent. All that system manages is > (for example) one billion blocks, each containing exactly 4096 bytes. I do > not need any metadata per block or locking. There will be multiple nodes, > all reading and writing the data concurrently. If two nodes A and B write to > the same block concurrently I expect that all nodes have either version A or > version B of the block afterwards.**** > > ** ** > > I’m not sure which of the option is the easiest to implement and which will > give me the highest performance.**** > > ** ** > > #2: Cassandra: Would you store the data in multiple rows? Columns? How much > data per column? I should probably ask the Cassandra people about this...* > *** > > #3: BookKeeper: Every node is writing to the data. I’d use BookKeeper as > write-ahead log. Was BookKeeper built for that kind of workload?**** > > ** ** > > Has anyone else done something similar? I couldn’t find anything in the > archives...**** > > ** ** > > ** ** > > Simon**** > > ** ** > > ** ** > > *From:* Flavio Junqueira [mailto:[EMAIL PROTECTED]] > *Sent:* Mittwoch, 13. Juli 2011 14:01 > *To:* [EMAIL PROTECTED] > *Subject:* Re: Shared block storage via ZooKepper**** > > ** ** > > Hi Simon, It is not entirely clear to me what you need zookeeper for in > this case. Are blocks replicated and you need to guarantee that the updates > are consistent across replicas? **** > > ** ** > > On your observations, I'm quite sure people will have an opinion, so here > are my thoughts, which might not be representative of the whole community : > **** > > 1- You're right, we do not recommended to use ZooKeeper directly as the > data store. ZooKeeper servers keep their state in memory.**** > > 2- Cassandra already provides replication. Are you trying to strengthen the > guarantees of Cassandra? I don't get it...**** > > 3- Sound right that you could use BK as a journal, but it is not clear > which element is writing to the journal. Are you assuming a metadata manager > such as the namenode of HDFS?**** > > 4- I'm not sure what this option means. Are you proposing ZooKeeper to > manage the metadata of the file system? If so, I don't find it entirely > unrealistic, since metadata updates are supposed to be small and the > performance of ZooKeeper should be good enough for your case, but it might > be awkward to have your block storage clients talking directly to ZooKeeper. > Changes to metadata management would imply in this case rolling out a new > version of the client application instead of just having the changes +
Ted Dunning 2011-07-13, 16:52
-
RE: Shared block storage via ZooKepperSimon Felix 2011-07-13, 18:13
Thanks for the suggestion but I gues I cannot use MapR for my purpose. I’m working on a non-commercial hobby project that one day I might make commercial. I believe what I want to use/build is simpler than a distributed file system because I don’t have to care about:
- Metadata - Locking - Hierarchies - Access rights - Lookups So if anyone knows of free, appropriately licensed alternative I’d be happy to use that. From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Mittwoch, 13. Juli 2011 18:52 To: [EMAIL PROTECTED] Subject: Re: Shared block storage via ZooKepper Simon, What you are describing is (roughly) a general read-write distributed and replicated file system. This is a hard problem if you want high performance, absolute consistency and significant amounts of failure tolerance. Building such a system from scratch is a difficult proposition. Frankly, it also sounds just like the filesystem component of MapR (conflict alert, I work for MapR Technologies). You may have additional constraints on what you are looking for, but to meet the requirements that you have already stated, you should take a look at our offering. I can imagine scenarios where this wouldn't be satisfactory, particularly if this is a homework assignment, but if you are simply trying to solve a real engineering problem, it should do very well. I don't want to hijack this list with non-Zookeeper discussion so feel free to contact me directly for more pointers. Ohh... I should mention MapR uses Zookeeper prominently and is glad to do so. The strictness and durability of ZK are ideal as the last resort determinant of coordination. In many areas of our system, the ZK trade-offs are not appropriate, especially where speed is critical, but that isn't what ZK was designed to do. Using ZK appropriately gives extremely good results. On Wed, Jul 13, 2011 at 5:15 AM, Simon Felix <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Thanks for the reply. I’ll try to clarify my question a bit. I want to simulate a single, fault-tolerant shared block storage device. This means everything should be replicated and consistent. All that system manages is (for example) one billion blocks, each containing exactly 4096 bytes. I do not need any metadata per block or locking. There will be multiple nodes, all reading and writing the data concurrently. If two nodes A and B write to the same block concurrently I expect that all nodes have either version A or version B of the block afterwards. I’m not sure which of the option is the easiest to implement and which will give me the highest performance. #2: Cassandra: Would you store the data in multiple rows? Columns? How much data per column? I should probably ask the Cassandra people about this... #3: BookKeeper: Every node is writing to the data. I’d use BookKeeper as write-ahead log. Was BookKeeper built for that kind of workload? Has anyone else done something similar? I couldn’t find anything in the archives... Simon From: Flavio Junqueira [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Mittwoch, 13. Juli 2011 14:01 To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Shared block storage via ZooKepper Hi Simon, It is not entirely clear to me what you need zookeeper for in this case. Are blocks replicated and you need to guarantee that the updates are consistent across replicas? On your observations, I'm quite sure people will have an opinion, so here are my thoughts, which might not be representative of the whole community : 1- You're right, we do not recommended to use ZooKeeper directly as the data store. ZooKeeper servers keep their state in memory. 2- Cassandra already provides replication. Are you trying to strengthen the guarantees of Cassandra? I don't get it... 3- Sound right that you could use BK as a journal, but it is not clear which element is writing to the journal. Are you assuming a metadata manager such as the namenode of HDFS? 4- I'm not sure what this option means. Are you proposing ZooKeeper to manage the metadata of the file system? If so, I don't find it entirely unrealistic, since metadata updates are supposed to be small and the performance of ZooKeeper should be good enough for your case, but it might be awkward to have your block storage clients talking directly to ZooKeeper. Changes to metadata management would imply in this case rolling out a new version of the client application instead of just having the changes implemented on the service side. -Flavio On Jul 13, 2011, at 12:02 PM, Simon Felix wrote: Hello everyone What is the best way to build a distributed, shared storage system on top of ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. store billions of 4k blocks). Consistency and Availability are important, as is throughput (both read & write). I need at least 50 MB/s with 3 nodes with two regular SATA drives each for my application. Some options I came up with: 1. Use ZooKeeper directly as a data store (Not recommended according to the docs - and it really leads to abysmally bad performance, I tested that) 2. Use Cassandra as data store 3. Use BookKeeper as write-ahead log and implement my own underlying store 4. Use ZooKeeper to create my own (probably buggy...) data store What would you recommend? Are there other options? Cheers, Simon flavio junqueira research scientist [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> direct +34 93-183-8828<tel:%2B34%2093-183-8828> avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300<tel:%28408%29%20349%203300> fax (408) 349 3301<tel:%28408%29%20349%203301> [cid:[EMAIL PROTECTED]B7C0] +
Simon Felix 2011-07-13, 18:13
-
Re: Shared block storage via ZooKepperScott Fines 2011-07-13, 18:25
Cassandra and/or HBase would work pretty well for this, it sounds like,
though I'm not sure that HBase satisfies your hardware requirements. Project Voldemort might also be a good option, though you'd suffer if you tried to get groups of blocks at the same time. If I were writing it, and there was some information regarding a good grouping policy, I would probably use Cassandra and store each block in a single column. Of course, you could also store each block in a row with a single column, which would also work, depending on your access patterns. If you used this, you would probably only use ZooKeeper for: 1. Transactional support (or for row locking, at least) 2. Cassandra node discovery (for automated discovery of scaled out machines) 3. failure detection(?) Since Cassandra doesn't have necessarily strong consistency guarantees, ZooKeeper could also be used as an ordering-provider to create happens-before relationships. It all just depends on what you're actually trying to do, I think. Scott Fines On Wed, Jul 13, 2011 at 1:13 PM, Simon Felix <[EMAIL PROTECTED]> wrote: > Thanks for the suggestion but I gues I cannot use MapR for my purpose. I’m > working on a non-commercial hobby project that one day I might make > commercial. I believe what I want to use/build is simpler than a distributed > file system because I don’t have to care about: > > > - Metadata > > - Locking > > - Hierarchies > > - Access rights > > - Lookups > > So if anyone knows of free, appropriately licensed alternative I’d be happy > to use that. > > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Mittwoch, 13. Juli 2011 18:52 > To: [EMAIL PROTECTED] > Subject: Re: Shared block storage via ZooKepper > > Simon, > > What you are describing is (roughly) a general read-write distributed and > replicated file system. This is a hard problem if you want high > performance, absolute consistency and significant amounts of failure > tolerance. Building such a system from scratch is a difficult proposition. > > Frankly, it also sounds just like the filesystem component of MapR > (conflict alert, I work for MapR Technologies). You may have additional > constraints on what you are looking for, but to meet the requirements that > you have already stated, you should take a look at our offering. I can > imagine scenarios where this wouldn't be satisfactory, particularly if this > is a homework assignment, but if you are simply trying to solve a real > engineering problem, it should do very well. I don't want to hijack this > list with non-Zookeeper discussion so feel free to contact me directly for > more pointers. > > Ohh... I should mention MapR uses Zookeeper prominently and is glad to do > so. The strictness and durability of ZK are ideal as the last resort > determinant of coordination. In many areas of our system, the ZK trade-offs > are not appropriate, especially where speed is critical, but that isn't what > ZK was designed to do. Using ZK appropriately gives extremely good results. > On Wed, Jul 13, 2011 at 5:15 AM, Simon Felix <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> > wrote: > Thanks for the reply. I’ll try to clarify my question a bit. I want to > simulate a single, fault-tolerant shared block storage device. This means > everything should be replicated and consistent. All that system manages is > (for example) one billion blocks, each containing exactly 4096 bytes. I do > not need any metadata per block or locking. There will be multiple nodes, > all reading and writing the data concurrently. If two nodes A and B write to > the same block concurrently I expect that all nodes have either version A or > version B of the block afterwards. > > I’m not sure which of the option is the easiest to implement and which will > give me the highest performance. > > #2: Cassandra: Would you store the data in multiple rows? Columns? How much > data per column? I should probably ask the Cassandra people about this... +
Scott Fines 2011-07-13, 18:25
-
Re: Shared block storage via ZooKepperYang 2011-07-13, 17:17
actually I was just thinking about this and tried to ask exactly the same
question. now zk is used to store small pieces of data such as shared config, and used for locking/coordination, but since it has a replicated data store, it would be nice to use to store large volumes of data directly. in fact from the "Paxos made live" paper: http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/paxos_made_live.pdf page 3 "We devoted effort to designing clean interfaces separating the Paxos framework, the database, and Chubby. We did this partly for clarity while developing this system, but also with the intention of reusing the replicated log layer in other applications. We anticipate future systems at Google that seek fault-tolerance through replication. We believe that a fault-tolerant log is a powerful primitive on which to build such systems. " essentially in the google paxos implementation, application code can simply grab the latest committed log record, and use it for whatever it wants for the application. if Zookeeper abstracts out the messaging protocol, and provides the committed transaction "stream" as the interface to applications, potentially we could use it for many applications, including data storage. note that this is completely outside of the current ZK data model (znode and etc ), all we use from ZK is the underlying committed transactions stream, probably this part of ZK can be provided as a library. yang On Wed, Jul 13, 2011 at 5:01 AM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: > Hi Simon, It is not entirely clear to me what you need zookeeper for in > this case. Are blocks replicated and you need to guarantee that the updates > are consistent across replicas? > > On your observations, I'm quite sure people will have an opinion, so here > are my thoughts, which might not be representative of the whole community : > 1- You're right, we do not recommended to use ZooKeeper directly as the > data store. ZooKeeper servers keep their state in memory. > 2- Cassandra already provides replication. Are you trying to strengthen the > guarantees of Cassandra? I don't get it... > 3- Sound right that you could use BK as a journal, but it is not clear > which element is writing to the journal. Are you assuming a metadata manager > such as the namenode of HDFS? > 4- I'm not sure what this option means. Are you proposing ZooKeeper to > manage the metadata of the file system? If so, I don't find it entirely > unrealistic, since metadata updates are supposed to be small and the > performance of ZooKeeper should be good enough for your case, but it might > be awkward to have your block storage clients talking directly to ZooKeeper. > Changes to metadata management would imply in this case rolling out a new > version of the client application instead of just having the changes > implemented on the service side. > > -Flavio > > On Jul 13, 2011, at 12:02 PM, Simon Felix wrote: > > Hello everyone > > What is the best way to build a distributed, shared storage system on top > of > ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. > store > billions of 4k blocks). Consistency and Availability are important, as is > throughput (both read & write). I need at least 50 MB/s with 3 nodes with > two regular SATA drives each for my application. > > Some options I came up with: > 1. Use ZooKeeper directly as a data store (Not recommended according to the > docs - and it really leads to abysmally bad performance, I tested that) > 2. Use Cassandra as data store > 3. Use BookKeeper as write-ahead log and implement my own underlying store > 4. Use ZooKeeper to create my own (probably buggy...) data store > > What would you recommend? Are there other options? > > Cheers, > Simon > > > *flavio* > *junqueira* > > research scientist > > [EMAIL PROTECTED] > direct +34 93-183-8828 > > avinguda diagonal 177, 8th floor, barcelona, 08018, es > phone (408) 349 3300 fax (408) 349 3301 +
Yang 2011-07-13, 17:17
-
Re: Shared block storage via ZooKepperYang 2011-07-13, 17:21
ah.... never mind,
it is BookKeeper that I want... On Wed, Jul 13, 2011 at 10:17 AM, Yang <[EMAIL PROTECTED]> wrote: > actually I was just thinking about this and tried to ask exactly the same > question. > > now zk is used to store small pieces of data such as shared config, and > used for locking/coordination, but since it has a replicated data store, it > would be nice to use to store large volumes of data directly. > > in fact from the "Paxos made live" paper: > http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/paxos_made_live.pdf > page 3 > "We devoted effort to designing clean interfaces separating the Paxos > framework, the database, and > Chubby. We did this partly for clarity while developing this system, but > also with the intention of reusing the > replicated log layer in other applications. We anticipate future systems at > Google that seek fault-tolerance > through replication. We believe that a fault-tolerant log is a powerful > primitive on which to build such > systems. > " > > essentially in the google paxos implementation, application code can simply > grab the latest committed log record, and use it for whatever it wants for > the application. if Zookeeper abstracts out the messaging protocol, and > provides the committed transaction "stream" as the interface to > applications, potentially we could use it for many applications, including > data storage. note that this is completely outside of the current ZK data > model (znode and etc ), all we use from ZK is the underlying committed > transactions stream, probably this part of ZK can be provided as a library. > > > yang > > On Wed, Jul 13, 2011 at 5:01 AM, Flavio Junqueira <[EMAIL PROTECTED]>wrote: > >> Hi Simon, It is not entirely clear to me what you need zookeeper for in >> this case. Are blocks replicated and you need to guarantee that the updates >> are consistent across replicas? >> >> On your observations, I'm quite sure people will have an opinion, so here >> are my thoughts, which might not be representative of the whole community : >> 1- You're right, we do not recommended to use ZooKeeper directly as the >> data store. ZooKeeper servers keep their state in memory. >> 2- Cassandra already provides replication. Are you trying to strengthen >> the guarantees of Cassandra? I don't get it... >> 3- Sound right that you could use BK as a journal, but it is not clear >> which element is writing to the journal. Are you assuming a metadata manager >> such as the namenode of HDFS? >> 4- I'm not sure what this option means. Are you proposing ZooKeeper to >> manage the metadata of the file system? If so, I don't find it entirely >> unrealistic, since metadata updates are supposed to be small and the >> performance of ZooKeeper should be good enough for your case, but it might >> be awkward to have your block storage clients talking directly to ZooKeeper. >> Changes to metadata management would imply in this case rolling out a new >> version of the client application instead of just having the changes >> implemented on the service side. >> >> -Flavio >> >> On Jul 13, 2011, at 12:02 PM, Simon Felix wrote: >> >> Hello everyone >> >> What is the best way to build a distributed, shared storage system on top >> of >> ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. >> store >> billions of 4k blocks). Consistency and Availability are important, as is >> throughput (both read & write). I need at least 50 MB/s with 3 nodes with >> two regular SATA drives each for my application. >> >> Some options I came up with: >> 1. Use ZooKeeper directly as a data store (Not recommended according to >> the >> docs - and it really leads to abysmally bad performance, I tested that) >> 2. Use Cassandra as data store >> 3. Use BookKeeper as write-ahead log and implement my own underlying store >> 4. Use ZooKeeper to create my own (probably buggy...) data store >> >> What would you recommend? Are there other options? +
Yang 2011-07-13, 17:21
-
Re: Shared block storage via ZooKepperTed Dunning 2011-07-13, 17:31
See BookKeeper.
The analogy is this: ZK => Chubby BookKeeper => distributed log Application => Application. On Wed, Jul 13, 2011 at 10:17 AM, Yang <[EMAIL PROTECTED]> wrote: > actually I was just thinking about this and tried to ask exactly the same > question. > > now zk is used to store small pieces of data such as shared config, and > used for locking/coordination, but since it has a replicated data store, it > would be nice to use to store large volumes of data directly. > > in fact from the "Paxos made live" paper: > http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/paxos_made_live.pdf > page 3 > "We devoted effort to designing clean interfaces separating the Paxos > framework, the database, and > Chubby. We did this partly for clarity while developing this system, but > also with the intention of reusing the > replicated log layer in other applications. We anticipate future systems at > Google that seek fault-tolerance > through replication. We believe that a fault-tolerant log is a powerful > primitive on which to build such > systems. > " > > essentially in the google paxos implementation, application code can simply > grab the latest committed log record, and use it for whatever it wants for > the application. if Zookeeper abstracts out the messaging protocol, and > provides the committed transaction "stream" as the interface to > applications, potentially we could use it for many applications, including > data storage. note that this is completely outside of the current ZK data > model (znode and etc ), all we use from ZK is the underlying committed > transactions stream, probably this part of ZK can be provided as a library. > > > yang > > On Wed, Jul 13, 2011 at 5:01 AM, Flavio Junqueira <[EMAIL PROTECTED]>wrote: > >> Hi Simon, It is not entirely clear to me what you need zookeeper for in >> this case. Are blocks replicated and you need to guarantee that the updates >> are consistent across replicas? >> >> On your observations, I'm quite sure people will have an opinion, so here >> are my thoughts, which might not be representative of the whole community : >> 1- You're right, we do not recommended to use ZooKeeper directly as the >> data store. ZooKeeper servers keep their state in memory. >> 2- Cassandra already provides replication. Are you trying to strengthen >> the guarantees of Cassandra? I don't get it... >> 3- Sound right that you could use BK as a journal, but it is not clear >> which element is writing to the journal. Are you assuming a metadata manager >> such as the namenode of HDFS? >> 4- I'm not sure what this option means. Are you proposing ZooKeeper to >> manage the metadata of the file system? If so, I don't find it entirely >> unrealistic, since metadata updates are supposed to be small and the >> performance of ZooKeeper should be good enough for your case, but it might >> be awkward to have your block storage clients talking directly to ZooKeeper. >> Changes to metadata management would imply in this case rolling out a new >> version of the client application instead of just having the changes >> implemented on the service side. >> >> -Flavio >> >> On Jul 13, 2011, at 12:02 PM, Simon Felix wrote: >> >> Hello everyone >> >> What is the best way to build a distributed, shared storage system on top >> of >> ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. >> store >> billions of 4k blocks). Consistency and Availability are important, as is >> throughput (both read & write). I need at least 50 MB/s with 3 nodes with >> two regular SATA drives each for my application. >> >> Some options I came up with: >> 1. Use ZooKeeper directly as a data store (Not recommended according to >> the >> docs - and it really leads to abysmally bad performance, I tested that) >> 2. Use Cassandra as data store >> 3. Use BookKeeper as write-ahead log and implement my own underlying store >> 4. Use ZooKeeper to create my own (probably buggy...) data store +
Ted Dunning 2011-07-13, 17:31
-
Re: Shared block storage via ZooKepperYang 2011-07-13, 17:37
assuming you use option 3) bookkeeper, the following is probably way
too over-simplified, but that's the idea: all writers write to Bookkeeper ledger, and each of your actual datastore nodes keeps reading the ledger, each record would be the serialized form of a DB write op, and when the ledger reader reads out the record, it deserializes it, and applies it to the datastore it has, for example, just a mysql, or bdb, or something like the LSM tree used by Cassandra (memtable+sstable). reads to the store directly go to the data store nodes themselves. would this work? that does not sound a lot of work On Wed, Jul 13, 2011 at 3:02 AM, Simon Felix <[EMAIL PROTECTED]> wrote: > Hello everyone > > What is the best way to build a distributed, shared storage system on top of > ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. store > billions of 4k blocks). Consistency and Availability are important, as is > throughput (both read & write). I need at least 50 MB/s with 3 nodes with > two regular SATA drives each for my application. > > Some options I came up with: > 1. Use ZooKeeper directly as a data store (Not recommended according to the > docs - and it really leads to abysmally bad performance, I tested that) > 2. Use Cassandra as data store > 3. Use BookKeeper as write-ahead log and implement my own underlying store > 4. Use ZooKeeper to create my own (probably buggy...) data store > > What would you recommend? Are there other options? > > Cheers, > Simon > +
Yang 2011-07-13, 17:37
-
Re: Shared block storage via ZooKepperTed Dunning 2011-07-13, 17:55
This would (roughly) work. It will not give very high performance and you
will have consistency problems. On Wed, Jul 13, 2011 at 10:37 AM, Yang <[EMAIL PROTECTED]> wrote: > assuming you use option 3) bookkeeper, the following is probably way > too over-simplified, but > that's the idea: > > all writers write to Bookkeeper ledger, and each of your actual > datastore nodes keeps reading the ledger, each record would be the > serialized form of a DB write op, > and when the ledger reader reads out the record, it deserializes it, > and applies it to the datastore it has, for example, just a mysql, or > bdb, or something like the LSM tree used by Cassandra > (memtable+sstable). > > reads to the store directly go to the data store nodes themselves. > > > would this work? that does not sound a lot of work > > On Wed, Jul 13, 2011 at 3:02 AM, Simon Felix <[EMAIL PROTECTED]> wrote: > > Hello everyone > > > > What is the best way to build a distributed, shared storage system on top > of > > ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. > store > > billions of 4k blocks). Consistency and Availability are important, as is > > throughput (both read & write). I need at least 50 MB/s with 3 nodes with > > two regular SATA drives each for my application. > > > > Some options I came up with: > > 1. Use ZooKeeper directly as a data store (Not recommended according to > the > > docs - and it really leads to abysmally bad performance, I tested that) > > 2. Use Cassandra as data store > > 3. Use BookKeeper as write-ahead log and implement my own underlying > store > > 4. Use ZooKeeper to create my own (probably buggy...) data store > > > > What would you recommend? Are there other options? > > > > Cheers, > > Simon > > > +
Ted Dunning 2011-07-13, 17:55
-
RE: Shared block storage via ZooKepperSimon Felix 2011-07-13, 18:01
Could you explain why that is?
What level of performance do you expect? Why would there be consistency problems? > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Mittwoch, 13. Juli 2011 19:55 > To: [EMAIL PROTECTED] > Subject: Re: Shared block storage via ZooKepper > > This would (roughly) work. It will not give very high performance and you > will have consistency problems. > > On Wed, Jul 13, 2011 at 10:37 AM, Yang <[EMAIL PROTECTED]> wrote: > > > assuming you use option 3) bookkeeper, the following is probably way > > too over-simplified, but that's the idea: > > > > all writers write to Bookkeeper ledger, and each of your actual > > datastore nodes keeps reading the ledger, each record would be the > > serialized form of a DB write op, and when the ledger reader reads out > > the record, it deserializes it, and applies it to the datastore it > > has, for example, just a mysql, or bdb, or something like the LSM tree > > used by Cassandra (memtable+sstable). > > > > reads to the store directly go to the data store nodes themselves. > > > > > > would this work? that does not sound a lot of work > > > > On Wed, Jul 13, 2011 at 3:02 AM, Simon Felix <[EMAIL PROTECTED]> wrote: > > > Hello everyone > > > > > > What is the best way to build a distributed, shared storage system > > > on top > > of > > > ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. > > store > > > billions of 4k blocks). Consistency and Availability are important, > > > as is throughput (both read & write). I need at least 50 MB/s with 3 > > > nodes with two regular SATA drives each for my application. > > > > > > Some options I came up with: > > > 1. Use ZooKeeper directly as a data store (Not recommended according > > > to > > the > > > docs - and it really leads to abysmally bad performance, I tested > > > that) 2. Use Cassandra as data store 3. Use BookKeeper as > > > write-ahead log and implement my own underlying > > store > > > 4. Use ZooKeeper to create my own (probably buggy...) data store > > > > > > What would you recommend? Are there other options? > > > > > > Cheers, > > > Simon > > > > > +
Simon Felix 2011-07-13, 18:01
-
Re: Shared block storage via ZooKepperYang 2011-07-13, 18:03
the "high performance" part aside (I would guess that it should follow
the same performance of bookkeeper , which is ~~20kops/sec), why would there be consistency problems? I assume that BK uses the same protocol as described in ZAB. if you mean that a storage node could be lagging in applying the latest ledger item, so the storage node state could be stale, yes, but at least that gives us kind of an eventual consistency model. On Wed, Jul 13, 2011 at 10:55 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > This would (roughly) work. It will not give very high performance and you > will have consistency problems. > > On Wed, Jul 13, 2011 at 10:37 AM, Yang <[EMAIL PROTECTED]> wrote: > >> assuming you use option 3) bookkeeper, the following is probably way >> too over-simplified, but >> that's the idea: >> >> all writers write to Bookkeeper ledger, and each of your actual >> datastore nodes keeps reading the ledger, each record would be the >> serialized form of a DB write op, >> and when the ledger reader reads out the record, it deserializes it, >> and applies it to the datastore it has, for example, just a mysql, or >> bdb, or something like the LSM tree used by Cassandra >> (memtable+sstable). >> >> reads to the store directly go to the data store nodes themselves. >> >> >> would this work? that does not sound a lot of work >> >> On Wed, Jul 13, 2011 at 3:02 AM, Simon Felix <[EMAIL PROTECTED]> wrote: >> > Hello everyone >> > >> > What is the best way to build a distributed, shared storage system on top >> of >> > ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. >> store >> > billions of 4k blocks). Consistency and Availability are important, as is >> > throughput (both read & write). I need at least 50 MB/s with 3 nodes with >> > two regular SATA drives each for my application. >> > >> > Some options I came up with: >> > 1. Use ZooKeeper directly as a data store (Not recommended according to >> the >> > docs - and it really leads to abysmally bad performance, I tested that) >> > 2. Use Cassandra as data store >> > 3. Use BookKeeper as write-ahead log and implement my own underlying >> store >> > 4. Use ZooKeeper to create my own (probably buggy...) data store >> > >> > What would you recommend? Are there other options? >> > >> > Cheers, >> > Simon >> > >> > +
Yang 2011-07-13, 18:03
|