Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> schema examples


Copy link to this message
-
Re: schema examples
I would be reluctant to make generalizations.

On Sun, Dec 29, 2013 at 05:45:28PM -0500, Arshak Navruzyan wrote:
>    Josh, I am still a little stuck on the idea of how this would work in a
>    transactional app? (aka mixed workload of reads and writes).
>    I definitely see the power of using a serialized structure in order to
>    minimize the number of records but what will happen when rows get deleted
>    out of the main table (or mutated)? � In the bloated model I could see
>    some referential integrity code zapping the index entries as well. �In the
>    serialized structure design it seems pretty complex to go and update every
>    array that referenced that row. �
>    Is it fair to say that the D4M approach is a little better suited for
>    transactional apps and the wikisearch approach is better for
>    read-optimized index apps?
>
>    On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <[1][EMAIL PROTECTED]>
>    wrote:
>
>      Some context here in regards to the wikisearch:
>
>      The point of the protocol buffers here (or any serialized structure in
>      the Value) is to reduce the ingest pressure and increase query
>      performance on the inverted index (or transpose table, if I follow the
>      d4m phrasing).
>
>      This works well because most languages (especially English) follow a
>      Zipfian distribution: some terms appear very frequently while some occur
>      very infrequently. For common terms, we don't want to bloat our index,
>      nor spend time creating those index records (e.g. "the"). For uncommon
>      terms, we still want direct access to these infrequent words (e.g.
>      "supercalifragilisticexpialidocious")
>
>      The ingest affect is also rather interesting when dealing with Accumulo
>      as you're not just writing more data, but typically writing data to most
>      (if not all) tservers. Even the tokenization of a single document is
>      likely to create inserts to a majority of the tablets for your inverted
>      index. When dealing with high ingest rates (live *or* bulk -- you still
>      have the send data to these servers), minimizing the number of records
>      becomes important to be cognizant of as it may be a bottleneck in your
>      pipeline.
>
>      The query implications are pretty straightforward: common terms don't
>      bloat the index in size nor affect uncommon term lookups and those
>      uncommon term lookups remain specific to documents rather than a range
>      (shard) of documents.
>
>      On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
>
>        Sorry I mixed things up. �It was in the wikisearch example:
>
>        [2]http://accumulo.apache.org/example/wikisearch.html
>
>        "If the cardinality is small enough, it will track the set of
>        documents
>        by term directly."
>
>        On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL
>        <[3][EMAIL PROTECTED] <mailto:[4][EMAIL PROTECTED]>> wrote:
>
>        � � Hi Arshak,
>        � � � �See interspersed below.
>        � � Regards. �-Jeremy
>
>        � � On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan
>        <[5][EMAIL PROTECTED]
>        � � <mailto:[6][EMAIL PROTECTED]>> wrote:
>
>          � � Jeremy,
>
>          � � Thanks for the detailed explanation. �Just a couple of final
>          � � questions:
>
>          � � 1. �What's your advise on the transpose table as far as whether
>          to
>          � � repeat the indexed term (one per matching row id) or try to
>          store
>          � � all matching row ids from tedge in a single row in
>          tedgetranspose
>          � � (using protobuf for example). �What's the performance
>          implication
>          � � of each approach? �In the paper you mentioned that if it's a few
>          � � values they should just be stored together. �Was there a cut-off
>          � � point in your testing?
>
>        � � Can you clarify? �I am not sure what your asking.
>
>          � � 2. �You mentioned that the degrees should be calculated