The simple intuition on batch scanners is that they provide parallelism by
having multiple fetch threads to contact multiple servers at once. It's
pretty common to structure your key so that records are likely to be
scattered around your cluster, either by using a hash or random number
inside your row id. It's important to note that batch scanners don't
guarantee that the records returned will be in order.
The actual parallelism achieved will be controlled by three factors: the
distribution of tablets, the number of CPU's on the machines to which
tablets of interest are assigned, and the distribution of your keyspace
As an example, if you have 10 tablets and 5 rows that contain interesting
data, but all 5 rows are on tablets hosted on the same server and you only
have 4 processors, the actual parallelism will be 4. So you need to use
keyspace and cluster design strategies that will distribute key-value pairs
among tablets and processors.
At one point I think that the tablet was also a limiting factor to
parallelism - ie. if you have 5 rows in the same tablet, but have 16
processors, then the parallelism is still 1. I'm not sure if this is still
On Sun, Aug 12, 2012 at 5:49 PM, Steven Troxell <[EMAIL PROTECTED]>wrote:
> Hi All,
> I was wondering if someone would be willing to help evaulate my reasoning
> on the use of Scanner vs. BatchScanner, and see if I'm making the proper
> The background is I am attempting to benchmark an RDF application using
> Accumulo by evaluating the impact of scaling on performance (measured by
> query return time).
> The scan patterns currently use the Scanner class, and gets a single row
> of data. The table design/implementation is such that there is never a
> need to simultaneously scan multiple non-adjacent rows. One query from
> the GUI, should effectively result in a one-time single range scan. The
> size of data return varies widely, as small as 10 to say millions of
> results. The return order is not significant.
> Reading the API suggests: ". If you want to lookup a few ranges and
> expect those ranges to contain a lot of data, then use the Scanner
> instead" and the use of BatchScanner should be reserved for cases of
> simultaneously wanting to use multiple ranges. It additionally feels weird
> to be using batchscan on a "collection" of 1 range.
> That said, my performance so far shows scaling is not adding much, 6
> machines is the max performance of getting, with drops in performance over
> that amount. This contradicts the theoretical linear improvement I should
> be seeing. To my understanding, BatchScanning scans the Tservers in
> parallel, Scanner does not. Would it be reasonable to expect using
> BatchScanner would allow to see the effects of scaling closer to what they
> should be?
> My logic here is that I have X rows spread out across 10 machines. Right
> now whether I'm using 1 machine or 10 machines it is iteratively scanning
> allow rows. If I batchscanned would I be guaranteed to minimize the time
> to result of that of a lookup on 1 machine, instead of the average case of
> 5 machines, or worst case of 10 (assuming uniform data distribution and
> various other assumptions).
> I'm mainly just questioning this, because the API info on Scan/BatchScan,
> suggests Scan is the desired choice for my application, but I don't see how
> switching to Batchscan, even if I'm perhaps not utilizing it as intended,
> wouldn't improve scaling potential.
> Thanks in advance for thoughts/insight.