Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - BatchScanning with a very large collection of ranges


Copy link to this message
-
Re: BatchScanning with a very large collection of ranges
Keith Turner 2013-01-23, 18:51
How much data is coming back, and whats the data rate?  You can sum up
the size of the keys and values in your loop.

On Wed, Jan 23, 2013 at 1:24 PM, Slater, David M.
<[EMAIL PROTECTED]> wrote:
> First, thanks to everyone for their responses to my previous questions.
> (Mike, I’ll definitely take a look at Brian’s materials for iterator
> behavior.)
>
>
>
> Now I’m doing some sharded document querying (where the documents are small
> but numerous)—where I’m trying to get not just the list of documents but
> also return all of them (they are also stored in Accumulo). However, I’m
> running into a bottleneck in the retrieval process. It seems that the
> BatchScanner is quite slow at retrieving information when there is a very
> large number of (small) ranges (entries, i.e. docs), and increasing the
> thread count doesn’t seem to help.
>
>
>
> Basically, I’m taking all of the docIDs that are returned from the index
> process, making a new Range(docID), adding that to Collection<Range> ranges,
> and then adding those ranges to the new BatchScanner to return the
> information:
>
>
>
> …
>
> Collection<Range> docRanges = new LinkedList<Range>();
>
> for (Map.Entry<Key, Value> entry : indexScanner) { // Go through index table
> here
>
>             Text docID = entry.getKey().getColumnQualifier();
>
>             docRanges.add(new Range(docID));
>
> }
>
>
>
> int threadCount = 20;
>
> String docTableName = “docTable”;
>
> BatchScanner docScanner = connector.createBatchScanner(docTableName, new
> Authorizations(), threadCount);
>
> docScanner.setRanges(docRanges); // large collection of ranges
>
>
>
> for (Map.Entry<Key, Value> doc : docScanner) { // retrieve doc data
>
>             ...
>
> }
>
> …
>
>
>
> Is this a naïve way of doing this? Would trying to group documents into
> larger ranges (when adjacent) be a more viable approach?
>
>
>
> Thanks,
>
> David