Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Iterators - updating other rows


+
Peter Tillotson 2013-07-15, 10:38
+
Keith Turner 2013-07-15, 12:49
Copy link to this message
-
Re: Iterators - updating other rows
Reading the paper and looking at you're implementation, this is certainly in the ball park I am striving for. The way I think of it is each ''spreadsheet cell'' should look after itself, it's called data flow architectures in some of the older literature. 

My current implementation uses Iterators and a my data is split over several column qualifiers, which I know will be processed in order. Actions on current column are depend on the state of previous columns. What I'm trying to avoid are disk seeks - if I can fold updates in during compaction I can reduce wasted operations. 

I've effectively got context for the Observer. 

Tnx for the deadlock scenarios - I'm pretty certain it is Situation 1.

Peter
________________________________
 From: Keith Turner <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; Peter Tillotson <[EMAIL PROTECTED]>
Sent: Monday, 15 July 2013, 13:49
Subject: Re: Iterators - updating other rows
 

On Mon, Jul 15, 2013 at 6:38 AM, Peter Tillotson <[EMAIL PROTECTED]> wrote:

I've got two tables of dependent data, which I was hoping to update efficiently during compaction. This leads to the following requirements:
>  - Changes to other rows
>  - Changes in other tables
>
>
>I've fought with iterators and embedding writers, but have had to fall back to map reduce jobs to complete the update. 
>
>
>Is there a recommended approach to this?

Writing to Accumulo from an iterator can lead to deadlock.  I can think of at least the following two situations, but there are probably more.

Situation 1 

 1. Memory is full on tablet server 1 and writes are held
 2. Tablet X is on Tserver 1 and is scheduled for compaction to free memory
 3. Tablet X tries to write to Tablet server 1, but the writes block because memory is full (deadlock)
 4. No other tablet on Tserver 1 can be written to because memory is full and can not be flushed, 
     so the problem snowballs
Situation 2

 1. Tserver 2 is hosting Tablet Y & Z
 2. Tablet Y & Z have data in memory
 3. Tserver 2 dies
 4. Tserver 3 loads Tablet Y, recovers its data, and tries to compact
 5 Tablet Y tries to write to Tablet Z during compaction 
 6. Tserver 4 loads Tablet Z, recovers its data, and tries to compact
 7 Tablet Z tries to write to Tablet Y during compaction 
 8. Tablets Y & Z are not loaded yet, but trying to write each other (deadlock)
 9. Tablet servers 2 and 3 can not load any more tablets, because their load threads are both stuck.
     so the problem snowballs

I am currently working on an implementation of Percolator[1].  Not something you can use now, but I am curious if you could use Percolator to solve your problem?  I am very interested in feedback on this project while its in its formative stages.  I hope to have it finished w/ Accumulo 1.6.0.

[1]: https://github.com/keith-turner/Accismus
>
>I bit more detail about the algorithm. 
>
>
>I've two tables with different sort orders, and I use ngram row ids to group element and split over multiple tablets, so:
>
>
>Table1
>nm: key1: 000: newValueId2
>nm: key2: type: valueId1
>nm: key3: type: valueId1
>
>
>Table2
>ab: valueId1: 001: blob
>ab: valueId1:key2: nm
>..
>..
>    
>Multiple keys point to the same value in the other table but both keys and values are liable to changes ... what I was trying to do was use special columns (column Qaulifier 000 above), I call them care-of to do redirects as data changes real-time, with iterators this would becomes eventually consistent and be very efficiently but a MapReduce approach requires multiple table scans of each large table. I like the approach because the ngram splits / groups data and the two different sorts give me different nice query characteristics.
>
>
>For some reason the embedded writers were blocking - I may retry with a larger cluster. I fought with it for a few days then resorted to MapReduce jobs until I get a chance to look at the Accumulo code more closely. 
>
>
>Would it be easy to add a special iterator that accepts (Text, Mutation) pairs much as the AccumuloOutputFormat does ?  
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB