Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Accumulo Utilities


Copy link to this message
-
Re: Accumulo Utilities
On Thu, Mar 28, 2013 at 12:15 PM,  <[EMAIL PROTECTED]> wrote:
> Thanks! I like the idea of sending my own thread pool to the batch scanner, that would definitely be the better solution.

Would you like to open a ticket about this issue?

I just remembered, there is an issues w/ this approach to be aware of
.  I have seen this when multiple threads share a batch scanner (more
in this below).  Consider the following situation.

 1. Thread A gives a lot of work to BatchScanner1 using Threadpool1,
creating BatchScannerIterator1
 2. BatchScannerIterator1's internal queue fills up as result of work
given by Thread A
 3. All threads in ThreadPool1 block trying to add to
BatchScannerIterator1 queue
 4. Thread B gives a lot of work to BatchScanner2 using Threadpool1,
creating BatchScannerIterator2
 5. Thread B attempts to iterate over BatchScannerIterator2, but
blocks forever because no threads service it

This problem occurs because Thread A never reads from BatchScannerIterator1

In the current code, multiple threads can use a BatchScanner.  You
just need to make configuring the BatchScanner and getting an iterator
an atomic operation.   When an iterator is created by a batch scanner,
it copies the config that exist at that point in time.  Changes to the
BatchScanner config after an iterator is created, will not affect the
iterator.

>
> Yeah I thought about creating a batch scanner with only one thread, but I was not sure if that is making a separate thread (outside of the current one) or using the current one. At the time I did not want a new thread to be created at all. Though, didn't realize the Scanner was also spinning up a thread at all, thought that was in process.

The batch scanner will create a new thread pool w/ one thread.

>
> To mitigate the separate RPC call per range, would it make more sense to do a "binRanges" based on the ranges at the tablets to reduce the number of ranges?

Probably do not want to combine ranges, that could bring back data in
the gaps between ranges.

>
> On Mar 28, 2013, at 11:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote:
>
>> I took a quick look at the code. Excluding the threading issue, a
>> major conceptual difference is that BatchScannerWithScanners seems to
>> do a RPC round trip for each range.   The TabletServerBatchReader
>> sends all of the ranges that a tablet server needs to lookup in one
>> RPC.
>>
>> Instead of creating a BatchScannerWithScanners, maybe you could create
>> a batch scanner with just one thread when resources are exceeded?
>> This will be similar to what you are doing now, just one thread will
>> be doing work fetching data.  The client thread would just be waiting
>> on this background thread.   Although this does allow the processing
>> of result to happen concurrently with fetching of data.  Using
>> BatchScannerWithScanners would not allow this.
>>
>> Something to be aware of, the regular scanner will spin up a read
>> ahead thread if you read a lot of data through it.  It does not do
>> this immediately, only after fetching a few batches of key value pairs
>> from the tablet server.  If this happens you could have one thread
>> fetching data while the client thread processes results.
>>
>> Do you think we should open a a ticket about giving users control over
>> threads created by client code?    Maybe users could pass in their own
>> thread pool to a batch scanner?
>>
>>
>> Keith
>>
>> On Thu, Mar 28, 2013 at 11:00 AM,  <[EMAIL PROTECTED]> wrote:
>>> In some of my projects, we needed to control the number of threads spun up with the use of multiple batch scanners. We created a utility to control the number of threads, and if the max threads has been reached, return a batch scanner that is actually backed by Scanners. Wanted to get any feedback on the code. Seems like such a simple thing to do, I bet someone already has this. Thanks!
>>>
>>> https://github.com/calrissian/mango/tree/master/accumulo
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB