Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - Batchscanner and Tablet Memory


+
Slater, David M. 2013-03-15, 19:08
+
Keith Turner 2013-03-15, 19:23
+
Slater, David M. 2013-03-21, 20:02
+
Keith Turner 2013-03-21, 20:38
Copy link to this message
-
RE: Batchscanner and Tablet Memory
Slater, David M. 2013-03-21, 22:01
Awesome, thank you!

-----Original Message-----
From: Keith Turner [mailto:[EMAIL PROTECTED]]
Sent: Thursday, March 21, 2013 4:39 PM
To: [EMAIL PROTECTED]
Subject: Re: Batchscanner and Tablet Memory

On Thu, Mar 21, 2013 at 4:02 PM, Slater, David M.
<[EMAIL PROTECTED]> wrote:
> Thanks Keith, that was very helpful.
>
> As for your comment "Multiple threads can scan a tablet concurrently", is there any way to force a BatchScanner to run at most one thread on a tablet, or to have it give the entire tablet range [a, c) to an iterator instead of breaking it up into [a, b) and [b, c) for different iterators on the same tablet?

A batch scanner will not use more than one thread to scan an
individual tablet.   I was just responding to your question asking if
multiple threads can scan a tablet.   If there are multiple scanners
and batch scanner, then you could have multiple threads scanning a tablet.

>
> If it is not designed to operate that way, are there methods in TabletServerBatchReader that would make sense to extend in order to add that functionality?
>
> Best regards,
> David
>
> -----Original Message-----
> From: Keith Turner [mailto:[EMAIL PROTECTED]]
> Sent: Friday, March 15, 2013 3:24 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Batchscanner and Tablet Memory
>
> On Fri, Mar 15, 2013 at 3:08 PM, Slater, David M.
> <[EMAIL PROTECTED]> wrote:
>> Hi again,
>>
>>
>>
>> I am curious as to how Accumulo handles multiple threads in a
>> Batchscanner, and what its ramifications are for memory use on a node.
>>
>>
>>
>> Let's say I start a Batchscanner with 20 threads, and scan across the
>> entire range of rows in a table of 80 tablets, spread across 4 nodes.
>> Will the Batchscanner try to spin off 20 threads if possible, or will
>> it try to match it to the number of nodes? Should I try to match the
>> number of threads with the number of cores that will be working on the data?
>>
>>
>
> When the batch scanner has more threads than nodes, it will run
> multiple scans on each node.   It will only do this for nodes where it
> has multiple tablets to scan.   So in your example I think it may run
> 20/4=5 scans on each node.  Each scan would access 80/20=4 tablets.
>
>>
>> When a thread is spun off, my thinking is that the tablet that the
>> thread is spun off on will move the entire tablet to memory, and then
>> the tablet will be iterated through. Is this how it typically happens
>> (or is there possibly multiple threads on the same tablet)? If so, do
>> I have to worry about memory issues if, say, one of the nodes tries
>> to move 10 tablets into memory, but doesn't have 20 GB of RAM left to store it?
>
> Entire tablets are not loaded into memory when you scan a tablet.
> Tablets are composed of rfiles.  RFiles are composed of blocks of key values.  So only a few of these key/blocks from rfiles are loaded at any given time.  It possible that these RFile blocks may be cached in the tablet server process depending on your configuration.
>
> Multiple threads can scan a tablet concurrently.
>
>>
>>
>>
>> Sorry for the vagueness of the questions, but I'm trying to
>> understand how the general process works under the covers, in order
>> to diagnose some performance issues I have been having.
>>
>>
>>
>> Thanks,
>> David