Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Iterators - updating other rows


Copy link to this message
-
Re: Iterators - updating other rows
On Mon, Jul 15, 2013 at 6:38 AM, Peter Tillotson <[EMAIL PROTECTED]>wrote:

> I've got two tables of dependent data, which I was hoping to update
> efficiently during compaction. This leads to the following requirements:
>   - Changes to other rows
>   - Changes in other tables
>
> I've fought with iterators and embedding writers, but have had to fall
> back to map reduce jobs to complete the update.
>
> Is there a recommended approach to this?
>

Writing to Accumulo from an iterator can lead to deadlock.  I can think of
at least the following two situations, but there are probably more.

Situation 1

 1. Memory is full on tablet server 1 and writes are held
 2. Tablet X is on Tserver 1 and is scheduled for compaction to free memory
 3. Tablet X tries to write to Tablet server 1, but the writes block
because memory is full (deadlock)
 4. No other tablet on Tserver 1 can be written to because memory is full
and can not be flushed,
     so the problem snowballs
Situation 2

 1. Tserver 2 is hosting Tablet Y & Z
 2. Tablet Y & Z have data in memory
 3. Tserver 2 dies
 4. Tserver 3 loads Tablet Y, recovers its data, and tries to compact
 5 Tablet Y tries to write to Tablet Z during compaction
 6. Tserver 4 loads Tablet Z, recovers its data, and tries to compact
 7 Tablet Z tries to write to Tablet Y during compaction
 8. Tablets Y & Z are not loaded yet, but trying to write each other
(deadlock)
 9. Tablet servers 2 and 3 can not load any more tablets, because their
load threads are both stuck.
     so the problem snowballs

I am currently working on an implementation of Percolator[1].  Not
something you can use now, but I am curious if you could use Percolator to
solve your problem?  I am very interested in feedback on this project while
its in its formative stages.  I hope to have it finished w/ Accumulo 1.6.0.

[1]: https://github.com/keith-turner/Accismus
> I bit more detail about the algorithm.
>
> I've two tables with different sort orders, and I use ngram row ids to
> group element and split over multiple tablets, so:
>
> Table1
> nm: key1: 000: newValueId2
> nm: key2: type: valueId1
> nm: key3: type: valueId1
>
> Table2
> ab: valueId1: 001: blob
> ab: valueId1:key2: nm
> ..
> ..
>
> Multiple keys point to the same value in the other table but both keys and
> values are liable to changes ... what I was trying to do was use special
> columns (column Qaulifier 000 above), I call them care-of to do redirects
> as data changes real-time, with iterators this would becomes eventually
> consistent and be very efficiently but a MapReduce approach requires
> multiple table scans of each large table. I like the approach because the
> ngram splits / groups data and the two different sorts give me different
> nice query characteristics.
>
> For some reason the embedded writers were blocking - I may retry with a
> larger cluster. I fought with it for a few days then resorted to MapReduce
> jobs until I get a chance to look at the Accumulo code more closely.
>
> Would it be easy to add a special iterator that accepts (Text, Mutation)
> pairs much as the AccumuloOutputFormat does ?
>
> Many thanks in advance
>
> Peter.
>