Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> schema examples


Copy link to this message
-
Re: schema examples
Arshak,

Yes and no. Accumulo Combiners help a bit here.

For servicing inserts and deletes (treating an update as the combination
of the two), both models work, although a serialized list is a little
more tricky to manage (as most optimizations end up).

You will most likely want to have a Combiner set on your inverted index
for the purposes of aggregating multiple inserts together into a single
Key-Value. This happens naturally at scan time for you (by virtue of the
combiner) and then gets persisted to disk in a merged for during a major
compaction. The same logic can be applied to deletions. Keeping a sorted
list of IDs in your serialized structure makes this algorithm pretty
easy. One caveat to note is that Accumulo won't always compact *every*
file in a tablet, so deletions may need to be persisted in that
serialized structure to ensure that the deletion persists (we can go
more into that later as I assume that isn't clear).

Speaking loosely for D4M as I haven't seen the code as to how it uses
Accumulo, both should ensure referential integrity, as such, they should
both be capable of servicing the same use-cases. While keeping a
serialized list is a bit more work in your code, there should be
performance gains seen in this approach.

On 12/29/2013 5:45 PM, Arshak Navruzyan wrote:
> Josh, I am still a little stuck on the idea of how this would work in a
> transactional app? (aka mixed workload of reads and writes).
>
> I definitely see the power of using a serialized structure in order to
> minimize the number of records but what will happen when rows get
> deleted out of the main table (or mutated)?   In the bloated model I
> could see some referential integrity code zapping the index entries as
> well.  In the serialized structure design it seems pretty complex to go
> and update every array that referenced that row.
>
> Is it fair to say that the D4M approach is a little better suited for
> transactional apps and the wikisearch approach is better for
> read-optimized index apps?
>
>
> On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     Some context here in regards to the wikisearch:
>
>     The point of the protocol buffers here (or any serialized structure
>     in the Value) is to reduce the ingest pressure and increase query
>     performance on the inverted index (or transpose table, if I follow
>     the d4m phrasing).
>
>     This works well because most languages (especially English) follow a
>     Zipfian distribution: some terms appear very frequently while some
>     occur very infrequently. For common terms, we don't want to bloat
>     our index, nor spend time creating those index records (e.g. "the").
>     For uncommon terms, we still want direct access to these infrequent
>     words (e.g. "__supercalifragilisticexpialidoc__ious")
>
>     The ingest affect is also rather interesting when dealing with
>     Accumulo as you're not just writing more data, but typically writing
>     data to most (if not all) tservers. Even the tokenization of a
>     single document is likely to create inserts to a majority of the
>     tablets for your inverted index. When dealing with high ingest rates
>     (live *or* bulk -- you still have the send data to these servers),
>     minimizing the number of records becomes important to be cognizant
>     of as it may be a bottleneck in your pipeline.
>
>     The query implications are pretty straightforward: common terms
>     don't bloat the index in size nor affect uncommon term lookups and
>     those uncommon term lookups remain specific to documents rather than
>     a range (shard) of documents.
>
>
>     On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
>
>         Sorry I mixed things up.  It was in the wikisearch example:
>
>         http://accumulo.apache.org/__example/wikisearch.html
>         <http://accumulo.apache.org/example/wikisearch.html>
>
>         "If the cardinality is small enough, it will track the set of
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB