Thanks Keith, that was very helpful.
As for your comment "Multiple threads can scan a tablet concurrently", is there any way to force a BatchScanner to run at most one thread on a tablet, or to have it give the entire tablet range [a, c) to an iterator instead of breaking it up into [a, b) and [b, c) for different iterators on the same tablet?
If it is not designed to operate that way, are there methods in TabletServerBatchReader that would make sense to extend in order to add that functionality?
From: Keith Turner [mailto:[EMAIL PROTECTED]]
Sent: Friday, March 15, 2013 3:24 PM
To: [EMAIL PROTECTED]
Subject: Re: Batchscanner and Tablet Memory
On Fri, Mar 15, 2013 at 3:08 PM, Slater, David M.
<[EMAIL PROTECTED]> wrote:
> Hi again,
> I am curious as to how Accumulo handles multiple threads in a
> Batchscanner, and what its ramifications are for memory use on a node.
> Let's say I start a Batchscanner with 20 threads, and scan across the
> entire range of rows in a table of 80 tablets, spread across 4 nodes.
> Will the Batchscanner try to spin off 20 threads if possible, or will
> it try to match it to the number of nodes? Should I try to match the
> number of threads with the number of cores that will be working on the data?
When the batch scanner has more threads than nodes, it will run
multiple scans on each node. It will only do this for nodes where it
has multiple tablets to scan. So in your example I think it may run
20/4=5 scans on each node. Each scan would access 80/20=4 tablets.
> When a thread is spun off, my thinking is that the tablet that the
> thread is spun off on will move the entire tablet to memory, and then
> the tablet will be iterated through. Is this how it typically happens
> (or is there possibly multiple threads on the same tablet)? If so, do
> I have to worry about memory issues if, say, one of the nodes tries to
> move 10 tablets into memory, but doesn't have 20 GB of RAM left to store it?
Entire tablets are not loaded into memory when you scan a tablet.
Tablets are composed of rfiles. RFiles are composed of blocks of key values. So only a few of these key/blocks from rfiles are loaded at any given time. It possible that these RFile blocks may be cached in the tablet server process depending on your configuration.
Multiple threads can scan a tablet concurrently.
> Sorry for the vagueness of the questions, but I'm trying to understand
> how the general process works under the covers, in order to diagnose
> some performance issues I have been having.