Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> number of query threads for batch scanner


Copy link to this message
-
Re: number of query threads for batch scanner
On Fri, Sep 28, 2012 at 9:35 AM, ameet kini <[EMAIL PROTECTED]> wrote:
>
> Thanks Eric and Keith.
>
> Is there any reason why the number of concurrent scans on a given tablet
> server depends on the number of tablets and not the number of cores on that
> tablet server? I'm looking at TabletServerBatchReaderIterator.doLookups.

Not really.  RFile has optimizations for seeking forward (ACCUMULO-473
has some numbers from an experiment I did).   So the ranges against an
individual tablet are sorted and seeked in order.   If you did break
up multiple ranges going to a single tablet, I think it would be best
to sort them and give threads sub-sequences of the sorted list to work
on.   This avoids multiple threads reading from the same rfile block
and doing redundant work to decode it.  Feel free to open a ticket to
explore this concept.

>
> Take Keith's example:
>
>  * For 1000 ranges that map to 1 tablet, it will execute 1 concurrent scan.
>
> Say, I had 8 cores on that tablet server and my tablet is large enough to
> warrant 8 concurrent scans. Sure, I can go about and further split my
> tablet, and get 8 concurrent scans - I ended up doing that. But is there any
> reason why 8 concurrent scans can't go against a single tablet? Maybe its
> difficult to estimate benefits of parallelism at that level, and its best
> left to users to tune the number of tablets, and base the level of
> parallelism on the number of tablets?
>
> Btw, the shell utility "merge -s <size>" rocks :)
>
> Thanks,
> Ameet
>
>
> On Fri, Sep 28, 2012 at 8:04 AM, Keith Turner <[EMAIL PROTECTED]> wrote:
>>
>> On Tue, Sep 25, 2012 at 3:17 PM, ameet kini <[EMAIL PROTECTED]> wrote:
>> > Thanks William.
>> >
>> > The issue here is that without knowing how the numQueryThreads
>> > translates to
>> > the number of concurrent scans, I cannot effectively tune that parameter
>> > to
>> > maximize resource usage on the tablet server. What I'm seeing is that
>> > even
>> > though there are four tablets on the tablet server, my number of
>> > concurrent
>> > scans never exceeds 3. This is despite setting numQueryThreads to a very
>> > high number and having 8 cores on the tablet server. I suspect with 3
>> > concurrent scans and no garbage collection happening at that moment,
>> > most of
>> > the cores are sitting idle.
>> >
>> > Ameet
>>
>> The amount if parallelism is determined by how your ranges map to
>> tablets. Below are some examples.
>>
>>  * For one range that maps to 10 tablets on 10 tablets severs, it will
>> execute 10 concurrent scans if numQueryThreads is >= 10.
>>  * For 1000 ranges that map to 10 tablets on 10 tablet servers, it
>> will execute 10 concurrent scans if numQueryThreads is >= 10.
>>  * For 1000 ranges that map to 10 tablets on 10 tablet servers, it
>> will execute 5 concurrent scans if numQueryThreads is 5.
>>  * For 1000 ranges that map to 1 tablet, it will execute 1 concurrent
>> scan.
>>
>> If you have more query threads than tablet server, the client code
>> will try to execute concurrent scans on a single tablet server.
>>
>> You can look at TabletServerBatchReaderIterator.doLookups() for the
>> details.  In this method it creates QueryTask objects and places them
>> on a thread pool.  The size of the thread pool is the user specified
>> numQueryThreads.
>>
>> >
>> > On Tue, Sep 25, 2012 at 3:08 PM, William Slacum
>> > <[EMAIL PROTECTED]> wrote:
>> >>
>> >> It should really be dependent upon the resources available to the
>> >> client.
>> >> You can set an arbitrarily high number of threads, but you're still
>> >> bound by
>> >> the number of parallel operations the CPU can make. I would assume the
>> >> sweet
>> >> spot is somewhere around that number-- try doing a small bench mark
>> >> with 2,
>> >> 4, 8, 16, etc threads and see where your performance starts to level
>> >> off.
>> >>
>> >>
>> >> On Tue, Sep 25, 2012 at 11:45 AM, ameet kini <[EMAIL PROTECTED]>
>> >> wrote:
>> >>>
>> >>> Probably worth adding that the table mentioned below has a bunch of
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB