Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - BatchScanning with a very large collection of ranges


Copy link to this message
-
Re: BatchScanning with a very large collection of ranges
John Stoneham 2013-01-25, 14:28
You also have some other options. One would be using an IteratorChain to
string together the results of several BatchScanners in a row which you
could kick off in parallel to batch up your reads.

Or, writing this in a sequence model: use the
Iterator<Map.Entry<Key,Value>> from the indexScanner to feed an
Iterator<Map.Entry<Key,Value>> of your creation that produces document
key/values. As you request the document key/values using next(), it
prefetches a number of index key/values, runs a batch scan, queues the
results for you. When it runs out of document results, it repeats. This
model has been successful for us when hitting a term index to pull millions
of source records without loading them all into client memory at the same
time.
On Wed, Jan 23, 2013 at 1:51 PM, Keith Turner <[EMAIL PROTECTED]> wrote:

> How much data is coming back, and whats the data rate?  You can sum up
> the size of the keys and values in your loop.
>
> On Wed, Jan 23, 2013 at 1:24 PM, Slater, David M.
> <[EMAIL PROTECTED]> wrote:
> > First, thanks to everyone for their responses to my previous questions.
> > (Mike, I’ll definitely take a look at Brian’s materials for iterator
> > behavior.)
> >
> >
> >
> > Now I’m doing some sharded document querying (where the documents are
> small
> > but numerous)—where I’m trying to get not just the list of documents but
> > also return all of them (they are also stored in Accumulo). However, I’m
> > running into a bottleneck in the retrieval process. It seems that the
> > BatchScanner is quite slow at retrieving information when there is a very
> > large number of (small) ranges (entries, i.e. docs), and increasing the
> > thread count doesn’t seem to help.
> >
> >
> >
> > Basically, I’m taking all of the docIDs that are returned from the index
> > process, making a new Range(docID), adding that to Collection<Range>
> ranges,
> > and then adding those ranges to the new BatchScanner to return the
> > information:
> >
> >
> >
> > …
> >
> > Collection<Range> docRanges = new LinkedList<Range>();
> >
> > for (Map.Entry<Key, Value> entry : indexScanner) { // Go through index
> table
> > here
> >
> >             Text docID = entry.getKey().getColumnQualifier();
> >
> >             docRanges.add(new Range(docID));
> >
> > }
> >
> >
> >
> > int threadCount = 20;
> >
> > String docTableName = “docTable”;
> >
> > BatchScanner docScanner = connector.createBatchScanner(docTableName, new
> > Authorizations(), threadCount);
> >
> > docScanner.setRanges(docRanges); // large collection of ranges
> >
> >
> >
> > for (Map.Entry<Key, Value> doc : docScanner) { // retrieve doc data
> >
> >             ...
> >
> > }
> >
> > …
> >
> >
> >
> > Is this a naïve way of doing this? Would trying to group documents into
> > larger ranges (when adjacent) be a more viable approach?
> >
> >
> >
> > Thanks,
> >
> > David
>

--
John Stoneham
[EMAIL PROTECTED]