You are absolutely right that the server-side space usage is a big concern.
One of the ways that we conserve space on the server is to do multiple
passes in which we scan a bunch of data and keep only part of it. This
gives us a trade-off between memory usage and cpu time. The space that we
use on the server gets amplified by a secondary lookup before data gets
back to the client. We've found that the optimal amount of memory to use on
the server is much larger that what can be processed before the scan buffer
fills, although it's still only on the order of a couple of megabytes.
Thanks for the suggestions, everyone. Keep em coming! I'm going to try out
a couple of prototypes and see where they get me.
On Mon, Apr 15, 2013 at 6:37 PM, Dave Marion <[EMAIL PROTECTED]> wrote:
> ---> I have found that increasing the buffer size also increases the
> for getting the first results.
> We have found that to be true also, we do the opposite to get to the
> result faster. Of course we are not performing a local sort first.
> ---> increasing the batch size too much puts significant memory
> on the process running the batch scanner
> Pushing the problem from the client to the server increases the
> complexity. I would be concerned with multiple concurrent scans that are
> saving state. The server side state will compete for tserver application
> memory. I would assume that you would have to build some feature to
> the amount of memory that the state can consume.
> -----Original Message-----
> From: Adam Fuchs [mailto:[EMAIL PROTECTED]]
> Sent: Monday, April 15, 2013 6:19 PM
> To: [EMAIL PROTECTED]
> Subject: Re: multi-table isolated batch scanner
> In this case we're filling the buffer before we can amortize the search
> cost. We're using a document-partitioned table design and we have to do a
> local sort before we can get the first result.
> I have found that increasing the buffer size also increases the latency for
> getting the first results. This application is both latency and throughput
> sensitive. In addition, increasing the batch size too much puts significant
> memory requirements on the process running the batch scanner.
> On Mon, Apr 15, 2013 at 5:33 PM, Keith Turner <[EMAIL PROTECTED]> wrote:
> > On Mon, Apr 15, 2013 at 5:06 PM, Adam Fuchs <[EMAIL PROTECTED]> wrote:
> > > Chris,
> > >
> > > The desire for isolation stems from the desire to amortize some
> > computation
> > > over a number of results. Say it takes 5 seconds to compute an
> > intersection
> > Would increasing the size of the key/value buffer help in your case?
> > The iterator stack is not torn down until that buffer fills up or the
> > end of tablet is reached. Are you concerned about the cost of
> > reconstructing the iterator stack across tablets?
> > > of a couple of sets within the iterators, and then streaming back
> > > the results takes a minute or so. If I have to redo the 5 second
> > > computation many times, as in to support the reconstruction of the
> > > iterator tree,
> > then
> > > that computation may start to dominate my query performance.
> > > Primarily, this means I need to be able to continue a scan without
> > > having to rebuild the iterators. Isolation in the scanner has that
> > > side effect. Proper isolation would be a "nice-to-have", but I can deal
> with not having it.
> > >
> > > Adam
> > >
> > >
> > >
> > > On Mon, Apr 15, 2013 at 4:13 PM, Christopher <[EMAIL PROTECTED]>
> > wrote:
> > >
> > >> Adam-
> > >>
> > >> It seems like you're talking about two features at once:
> > >> 1) Multi-table batch scanner.
> > >> 2) Scan Isolation on batch scanners like we have on regular scanners.
> > >> Is that correct?
> > >>
> > >> I can see the utility of a multi-table batch scanner, but I haven't
> > >> seen a compelling need for implementing isolation on the
> > >> batch-scanners. Do you have a use case in mind for that?