-Re: BatchScanner vs. Scanner logic
John Vines 2012-08-12, 22:32
On Sun, Aug 12, 2012 at 5:49 PM, Steven Troxell <[EMAIL PROTECTED]>wrote:
> Hi All,
> I was wondering if someone would be willing to help evaulate my reasoning
> on the use of Scanner vs. BatchScanner, and see if I'm making the proper
> The background is I am attempting to benchmark an RDF application using
> Accumulo by evaluating the impact of scaling on performance (measured by
> query return time).
> The scan patterns currently use the Scanner class, and gets a single row
> of data. The table design/implementation is such that there is never a
> need to simultaneously scan multiple non-adjacent rows. One query from
> the GUI, should effectively result in a one-time single range scan. The
> size of data return varies widely, as small as 10 to say millions of
> results. The return order is not significant.
> Reading the API suggests: ". If you want to lookup a few ranges and
> expect those ranges to contain a lot of data, then use the Scanner
> instead" and the use of BatchScanner should be reserved for cases of
> simultaneously wanting to use multiple ranges. It additionally feels weird
> to be using batchscan on a "collection" of 1 range.
There is a lot of variety in a range. You can have a range which consists
of a single row, and therefor a single server, or you can have a range
which spans a large amount of data up to the entire table. In that case,
while it may only be 1 range, it hits a lot of data. If the way your data
is oriented in Accumulo is guaranteed to hit 1 rowID, then using a scanner
vs. batch scanner for that 1 range will make no difference.
> That said, my performance so far shows scaling is not adding much, 6
> machines is the max performance of getting, with drops in performance over
> that amount. This contradicts the theoretical linear improvement I should
> be seeing. To my understanding, BatchScanning scans the Tservers in
> parallel, Scanner does not. Would it be reasonable to expect using
> BatchScanner would allow to see the effects of scaling closer to what they
> should be?
If you only pull back a single row, going from a Scanner to a BatchScanner
will make no difference. If you are iteratively Scanning for multiple
ranges however, you could see performance improvements by doing it all in a
single BatchScanner, assuming that you're not doing scans dependent on the
As for the theoretical linear improvement, the underlying assumption there
is that you are in some way fully utilizing the resources before scaling up
to a larger amount. If you're simply getting a single row, whether or not
your on 2 or 200 machines the performance should be same, not 100 times
better. Scaling out the architecture allows you to do more with it, not
necessarily do the same thing faster (although that can be a noticeable
effect if your swamping the systems). But depending on how you're utilizing
your data, scaling out too much could also be a performance hinderer (this
is the case where you are doing intersections, like the DocumentPartitioned
stuff in the wiki example).
> My logic here is that I have X rows spread out across 10 machines. Right
> now whether I'm using 1 machine or 10 machines it is iteratively scanning
> allow rows. If I batchscanned would I be guaranteed to minimize the time
> to result of that of a lookup on 1 machine, instead of the average case of
> 5 machines, or worst case of 10 (assuming uniform data distribution and
> various other assumptions).
>From everything I gathered from your setup, grabbing a single rowId using a
Scanner would have 0 performance difference than a batch scanner. If you're
rows are not necessarily by rowID, there is the potential to get back
results faster because if your row spans multiple tablets, they will all
return faster than had you done them sucessively like the Scanner does. But
if you're grabbing X rows, you can grab all X simultaneously instead of
iteratively like you would with Scanners (or without having to do your own
threading on your client).