-RE: Batchscanner and Tablet Memory
Slater, David M. 2013-03-21, 22:01
Awesome, thank you!
From: Keith Turner [mailto:[EMAIL PROTECTED]]
Sent: Thursday, March 21, 2013 4:39 PM
To: [EMAIL PROTECTED]
Subject: Re: Batchscanner and Tablet Memory
On Thu, Mar 21, 2013 at 4:02 PM, Slater, David M.
<[EMAIL PROTECTED]> wrote:
> Thanks Keith, that was very helpful.
> As for your comment "Multiple threads can scan a tablet concurrently", is there any way to force a BatchScanner to run at most one thread on a tablet, or to have it give the entire tablet range [a, c) to an iterator instead of breaking it up into [a, b) and [b, c) for different iterators on the same tablet?
A batch scanner will not use more than one thread to scan an
individual tablet. I was just responding to your question asking if
multiple threads can scan a tablet. If there are multiple scanners
and batch scanner, then you could have multiple threads scanning a tablet.
> If it is not designed to operate that way, are there methods in TabletServerBatchReader that would make sense to extend in order to add that functionality?
> Best regards,
> -----Original Message-----
> From: Keith Turner [mailto:[EMAIL PROTECTED]]
> Sent: Friday, March 15, 2013 3:24 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Batchscanner and Tablet Memory
> On Fri, Mar 15, 2013 at 3:08 PM, Slater, David M.
> <[EMAIL PROTECTED]> wrote:
>> Hi again,
>> I am curious as to how Accumulo handles multiple threads in a
>> Batchscanner, and what its ramifications are for memory use on a node.
>> Let's say I start a Batchscanner with 20 threads, and scan across the
>> entire range of rows in a table of 80 tablets, spread across 4 nodes.
>> Will the Batchscanner try to spin off 20 threads if possible, or will
>> it try to match it to the number of nodes? Should I try to match the
>> number of threads with the number of cores that will be working on the data?
> When the batch scanner has more threads than nodes, it will run
> multiple scans on each node. It will only do this for nodes where it
> has multiple tablets to scan. So in your example I think it may run
> 20/4=5 scans on each node. Each scan would access 80/20=4 tablets.
>> When a thread is spun off, my thinking is that the tablet that the
>> thread is spun off on will move the entire tablet to memory, and then
>> the tablet will be iterated through. Is this how it typically happens
>> (or is there possibly multiple threads on the same tablet)? If so, do
>> I have to worry about memory issues if, say, one of the nodes tries
>> to move 10 tablets into memory, but doesn't have 20 GB of RAM left to store it?
> Entire tablets are not loaded into memory when you scan a tablet.
> Tablets are composed of rfiles. RFiles are composed of blocks of key values. So only a few of these key/blocks from rfiles are loaded at any given time. It possible that these RFile blocks may be cached in the tablet server process depending on your configuration.
> Multiple threads can scan a tablet concurrently.
>> Sorry for the vagueness of the questions, but I'm trying to
>> understand how the general process works under the covers, in order
>> to diagnose some performance issues I have been having.