Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> schema examples


+
Arshak Navruzyan 2013-12-26, 20:10
+
Jeremy Kepner 2013-12-27, 01:33
+
Jeremy Kepner 2013-12-27, 01:31
+
Arshak Navruzyan 2013-12-28, 01:01
+
Kepner, Jeremy - 0553 - M... 2013-12-28, 18:36
+
Arshak Navruzyan 2013-12-29, 16:34
+
Kepner, Jeremy - 0553 - M... 2013-12-29, 16:42
+
Arshak Navruzyan 2013-12-29, 16:57
+
Kepner, Jeremy - 0553 - M... 2013-12-29, 17:12
+
Arshak Navruzyan 2013-12-29, 20:10
+
Josh Elser 2013-12-29, 20:27
+
Arshak Navruzyan 2013-12-29, 22:45
+
Josh Elser 2013-12-29, 23:51
Copy link to this message
-
Re: schema examples
I would be reluctant to make generalizations.

On Sun, Dec 29, 2013 at 05:45:28PM -0500, Arshak Navruzyan wrote:
>    Josh, I am still a little stuck on the idea of how this would work in a
>    transactional app? (aka mixed workload of reads and writes).
>    I definitely see the power of using a serialized structure in order to
>    minimize the number of records but what will happen when rows get deleted
>    out of the main table (or mutated)? � In the bloated model I could see
>    some referential integrity code zapping the index entries as well. �In the
>    serialized structure design it seems pretty complex to go and update every
>    array that referenced that row. �
>    Is it fair to say that the D4M approach is a little better suited for
>    transactional apps and the wikisearch approach is better for
>    read-optimized index apps?
>
>    On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <[1][EMAIL PROTECTED]>
>    wrote:
>
>      Some context here in regards to the wikisearch:
>
>      The point of the protocol buffers here (or any serialized structure in
>      the Value) is to reduce the ingest pressure and increase query
>      performance on the inverted index (or transpose table, if I follow the
>      d4m phrasing).
>
>      This works well because most languages (especially English) follow a
>      Zipfian distribution: some terms appear very frequently while some occur
>      very infrequently. For common terms, we don't want to bloat our index,
>      nor spend time creating those index records (e.g. "the"). For uncommon
>      terms, we still want direct access to these infrequent words (e.g.
>      "supercalifragilisticexpialidocious")
>
>      The ingest affect is also rather interesting when dealing with Accumulo
>      as you're not just writing more data, but typically writing data to most
>      (if not all) tservers. Even the tokenization of a single document is
>      likely to create inserts to a majority of the tablets for your inverted
>      index. When dealing with high ingest rates (live *or* bulk -- you still
>      have the send data to these servers), minimizing the number of records
>      becomes important to be cognizant of as it may be a bottleneck in your
>      pipeline.
>
>      The query implications are pretty straightforward: common terms don't
>      bloat the index in size nor affect uncommon term lookups and those
>      uncommon term lookups remain specific to documents rather than a range
>      (shard) of documents.
>
>      On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
>
>        Sorry I mixed things up. �It was in the wikisearch example:
>
>        [2]http://accumulo.apache.org/example/wikisearch.html
>
>        "If the cardinality is small enough, it will track the set of
>        documents
>        by term directly."
>
>        On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL
>        <[3][EMAIL PROTECTED] <mailto:[4][EMAIL PROTECTED]>> wrote:
>
>        � � Hi Arshak,
>        � � � �See interspersed below.
>        � � Regards. �-Jeremy
>
>        � � On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan
>        <[5][EMAIL PROTECTED]
>        � � <mailto:[6][EMAIL PROTECTED]>> wrote:
>
>          � � Jeremy,
>
>          � � Thanks for the detailed explanation. �Just a couple of final
>          � � questions:
>
>          � � 1. �What's your advise on the transpose table as far as whether
>          to
>          � � repeat the indexed term (one per matching row id) or try to
>          store
>          � � all matching row ids from tedge in a single row in
>          tedgetranspose
>          � � (using protobuf for example). �What's the performance
>          implication
>          � � of each approach? �In the paper you mentioned that if it's a few
>          � � values they should just be stored together. �Was there a cut-off
>          � � point in your testing?
>
>        � � Can you clarify? �I am not sure what your asking.
>
>          � � 2. �You mentioned that the degrees should be calculated
+
Dylan Hutchison 2013-12-28, 05:53
+
Josh Elser 2013-12-28, 15:52
+
Arshak Navruzyan 2013-12-28, 18:25
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB