|
|
-
sanity checking application WALogs make sense
Sukant Hajra 2012-09-15, 05:44
Hi guys,
We've been slowing inching towards using iterators more effectively. The typical use case of indexed docs fit one of our needs and we wrote a prototype for it.
We've recently realized that iterators are not just read-only, and that we can get more data-local functionality by taking advantage of their ability to mutate data as well. We've only begun to think more of how this may assist us. A /lot/ of our critical data-accesses are slightly complex, but local to one row. We have billions of entities in our system, so a simple bijection of entities to rows works our really well for us with respect to iterators.
Up to this point, we've had an planned architecture that uses Kestrel for WALog and a messaging system like Akka pipelining work. Akka would help us manage flowing work from the user to the log and from the log to orchestrations of Accumulo intra-row reads and writes. The log just helps us get some faster response time without sacrificing too much reliability.
Recently someone asked why use our own WALog when Accumulo has one natively in HDFS. My response has been that Accumulo's WALog is at a lower level of granularity of mutations. We want reliable orchestrations of mutations. Our orchestrations are idempotent, but we want something long the lines of at-least-once delivery for the entire orchestration. If an iterator goes down mid-processing, I fear Accumulo's native WALog is insufficient to claim we have a reliable enough system.
I could definitely go through source code to validate this opinion, but I thought I'd bounce this reasoning off the list first.
Also, I'm sure we're not the only people using Accumulo in this way. Please feel to advise us if anyone's got other ideas for an architecture or feels we're thinking about the problem backwards.
Thanks for your input, Sukant
-
Re: sanity checking application WALogs make sense
William Slacum 2012-09-15, 13:46
I'm a bit confused as to what you mean "if an iterator goes down mid-processing." If it goes down at all, then whatever scope it's running in- minor compaction, major compaction and scan- will most likely go down as well (unless your iterator eats an exception and ignores errors). A WALog shouldn't be deleted if whatever you were trying to do failed.
On Sat, Sep 15, 2012 at 1:44 AM, Sukant Hajra <[EMAIL PROTECTED]>wrote:
> Hi guys, > > We've been slowing inching towards using iterators more effectively. The > typical use case of indexed docs fit one of our needs and we wrote a > prototype > for it. > > We've recently realized that iterators are not just read-only, and that we > can > get more data-local functionality by taking advantage of their ability to > mutate data as well. We've only begun to think more of how this may > assist us. > A /lot/ of our critical data-accesses are slightly complex, but local to > one > row. We have billions of entities in our system, so a simple bijection of > entities to rows works our really well for us with respect to iterators. > > Up to this point, we've had an planned architecture that uses Kestrel for > WALog > and a messaging system like Akka pipelining work. Akka would help us > manage > flowing work from the user to the log and from the log to orchestrations of > Accumulo intra-row reads and writes. The log just helps us get some faster > response time without sacrificing too much reliability. > > Recently someone asked why use our own WALog when Accumulo has one > natively in > HDFS. My response has been that Accumulo's WALog is at a lower level of > granularity of mutations. We want reliable orchestrations of mutations. > Our > orchestrations are idempotent, but we want something long the lines of > at-least-once delivery for the entire orchestration. If an iterator goes > down > mid-processing, I fear Accumulo's native WALog is insufficient to claim we > have > a reliable enough system. > > I could definitely go through source code to validate this opinion, but I > thought I'd bounce this reasoning off the list first. > > Also, I'm sure we're not the only people using Accumulo in this way. > Please > feel to advise us if anyone's got other ideas for an architecture or feels > we're thinking about the problem backwards. > > Thanks for your input, > Sukant >
-
Re: sanity checking application WALogs make sense
Sukant Hajra 2012-09-15, 18:14
Excerpts from William Slacum's message of 2012-09-15 08:46:17 -0500: > > I'm a bit confused as to what you mean "if an iterator goes down > mid-processing." If it goes down at all, then whatever scope it's running in- > minor compaction, major compaction and scan- will most likely go down as well > (unless your iterator eats an exception and ignores errors). A WALog > shouldn't be deleted if whatever you were trying to do failed.
I believe I've answered my own question after thinking about iterators more and looking at the code for some of the implementations.
I was thinking about iterators "writing" changes to Accumulo using something like a BatchWriter. Now I'm coming to the conclusion that even if that were possible, it is not how iterators were designed, and very likely bad for data integrity. I don't feel that iterators should have any side-effects beyond scanning data through the source provided by the init() method. In this way, I'm beginning to think about iterators more purely functionally. Does that sound right? Or have people come up with iterator implementations with more side-effects?
For instance, in one of my algorithms, authors might write conflicting data to a row that needs to be resolved. I feel I could install iterators at scan, minor compaction, and major compaction to perform this resolution (which happens to be a very simple idempotent operation).
Sorry if none of this sounds like a concrete question. Some of what I'm looking for is conversation and validation in light of some limited local Accumulo expertise on my team.
Has anyone thought about building up a small IRC community, say on #accumulo on Freenode? There's a nice #hbase channel there, but at this point, I think I'm past the point of asking Bigtable-general questions.
-Sukant
-
Re: sanity checking application WALogs make sense
Billie Rinaldi 2012-09-17, 19:01
On Sat, Sep 15, 2012 at 11:14 AM, Sukant Hajra <[EMAIL PROTECTED]>wrote:
> Excerpts from William Slacum's message of 2012-09-15 08:46:17 -0500: > > > > I'm a bit confused as to what you mean "if an iterator goes down > > mid-processing." If it goes down at all, then whatever scope it's > running in- > > minor compaction, major compaction and scan- will most likely go down as > well > > (unless your iterator eats an exception and ignores errors). A WALog > > shouldn't be deleted if whatever you were trying to do failed. > > I believe I've answered my own question after thinking about iterators > more and > looking at the code for some of the implementations. > > I was thinking about iterators "writing" changes to Accumulo using > something > like a BatchWriter. Now I'm coming to the conclusion that even if that > were > possible, it is not how iterators were designed, and very likely bad for > data > integrity. I don't feel that iterators should have any side-effects beyond > scanning data through the source provided by the init() method. In this > way, > I'm beginning to think about iterators more purely functionally. Does that > sound right? Or have people come up with iterator implementations with > more > side-effects? >
Your conclusion is correct, we did not really intend for iterators to read or write outside of a single tablet. > > For instance, in one of my algorithms, authors might write conflicting > data to > a row that needs to be resolved. I feel I could install iterators at scan, > minor compaction, and major compaction to perform this resolution (which > happens to be a very simple idempotent operation). > > Sorry if none of this sounds like a concrete question. Some of what I'm > looking for is conversation and validation in light of some limited local > Accumulo expertise on my team. > > Has anyone thought about building up a small IRC community, say on > #accumulo on > Freenode? There's a nice #hbase channel there, but at this point, I think > I'm > past the point of asking Bigtable-general questions. >
We have recently started using #accumulo on freenode. Feel free to join us there!
Billie
> > -Sukant >
|
|