Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> scanner question in regards to columns loaded


Copy link to this message
-
Re: scanner question in regards to columns loaded
One small addendum to Christopher's explanation:

If you are using the Isolated Scanner[1], then the entire row will be
buffered on the client side. If you are have configured a table to use the
WholeRowIterator[2] in order to gain isolation guarantees while using e.g.
BatchScanner for performance reasons then that buffering instead has to
happen on the tablet servers.

Note that neither of these things are configured for use by default.

[1]:

http://accumulo.apache.org/1.5/accumulo_user_manual.html#_isolated_scanner
http://accumulo.apache.org/1.5/examples/isolation.html
http://accumulo.apache.org/1.5/apidocs/org/apache/accumulo/core/client/IsolatedScanner.html

[2]:
http://accumulo.apache.org/1.5/apidocs/org/apache/accumulo/core/iterators/user/WholeRowIterator.html

-Sean

On Fri, Jan 24, 2014 at 10:34 PM, Christopher <[EMAIL PROTECTED]> wrote:

> It's not quite clear what you mean by "load", but I think you mean
> "iterate over"?
>
> A simplified explanation is this:
>
> When you scan an Accumulo table, you are streaming each entry
> (Key/Value pair), one at a time, through your client code. They are
> only held in memory if you do that yourself in your client code. A row
> in Accumulo is the set of entries that share a particular value of the
> Row portion of the Key. They are logically grouped, but are not
> grouped in memory unless you do that.
>
> One additional note is regarding your index schema of a row being a
> search term and columns being documents. You will likely have issues
> with this strategy, as the number of documents for high frequency
> terms grows, because tablets do not split in the middle of a row. With
> your schema, a row could get too large to manage on a single tablet
> server. A slight variation, like concatenating the search term with a
> document identifier in the row (term=doc1, term=doc2, ....) would
> allow the high frequency terms to split into multiple tablets if they
> get too large. There are better strategies, but that's just one simple
> option.
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote:
> > If I have a row that as the key is a particular term and a set of columns
> > that stores the documents that the term appears in if I load the row is
> the
> > contents of all of the columns also loaded?  Is there a way to page over
> the
> > columns such that only N columns are in memory at any point?  In this
> > particular case the documents are all in a particular column family (say
> > docs) and the column qualifier is created dynamically, for arguments
> sake we
> > can say they are UUIDs.
>

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB