Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - schema examples


Copy link to this message
-
Re: schema examples
Josh Elser 2013-12-29, 23:51
Arshak,

Yes and no. Accumulo Combiners help a bit here.

For servicing inserts and deletes (treating an update as the combination
of the two), both models work, although a serialized list is a little
more tricky to manage (as most optimizations end up).

You will most likely want to have a Combiner set on your inverted index
for the purposes of aggregating multiple inserts together into a single
Key-Value. This happens naturally at scan time for you (by virtue of the
combiner) and then gets persisted to disk in a merged for during a major
compaction. The same logic can be applied to deletions. Keeping a sorted
list of IDs in your serialized structure makes this algorithm pretty
easy. One caveat to note is that Accumulo won't always compact *every*
file in a tablet, so deletions may need to be persisted in that
serialized structure to ensure that the deletion persists (we can go
more into that later as I assume that isn't clear).

Speaking loosely for D4M as I haven't seen the code as to how it uses
Accumulo, both should ensure referential integrity, as such, they should
both be capable of servicing the same use-cases. While keeping a
serialized list is a bit more work in your code, there should be
performance gains seen in this approach.

On 12/29/2013 5:45 PM, Arshak Navruzyan wrote:
> Josh, I am still a little stuck on the idea of how this would work in a
> transactional app? (aka mixed workload of reads and writes).
>
> I definitely see the power of using a serialized structure in order to
> minimize the number of records but what will happen when rows get
> deleted out of the main table (or mutated)?   In the bloated model I
> could see some referential integrity code zapping the index entries as
> well.  In the serialized structure design it seems pretty complex to go
> and update every array that referenced that row.
>
> Is it fair to say that the D4M approach is a little better suited for
> transactional apps and the wikisearch approach is better for
> read-optimized index apps?
>
>
> On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     Some context here in regards to the wikisearch:
>
>     The point of the protocol buffers here (or any serialized structure
>     in the Value) is to reduce the ingest pressure and increase query
>     performance on the inverted index (or transpose table, if I follow
>     the d4m phrasing).
>
>     This works well because most languages (especially English) follow a
>     Zipfian distribution: some terms appear very frequently while some
>     occur very infrequently. For common terms, we don't want to bloat
>     our index, nor spend time creating those index records (e.g. "the").
>     For uncommon terms, we still want direct access to these infrequent
>     words (e.g. "__supercalifragilisticexpialidoc__ious")
>
>     The ingest affect is also rather interesting when dealing with
>     Accumulo as you're not just writing more data, but typically writing
>     data to most (if not all) tservers. Even the tokenization of a
>     single document is likely to create inserts to a majority of the
>     tablets for your inverted index. When dealing with high ingest rates
>     (live *or* bulk -- you still have the send data to these servers),
>     minimizing the number of records becomes important to be cognizant
>     of as it may be a bottleneck in your pipeline.
>
>     The query implications are pretty straightforward: common terms
>     don't bloat the index in size nor affect uncommon term lookups and
>     those uncommon term lookups remain specific to documents rather than
>     a range (shard) of documents.
>
>
>     On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
>
>         Sorry I mixed things up.  It was in the wikisearch example:
>
>         http://accumulo.apache.org/__example/wikisearch.html
>         <http://accumulo.apache.org/example/wikisearch.html>
>
>         "If the cardinality is small enough, it will track the set of