|
|
-
Dealing with large data sets in client
Bryan Beaudreault 2012-03-27, 21:36
Hello,
I have timeseries data, most rows have anywhere from 10 to a few thousand columns, but outliers can have a million or more. Each column has some integer value (counters), and an integer identifier is the qualifier. On the client side, I want to scan from startDate to endDate, add up the total values for each identifier, sort the aggregated values, and return the top X (pagination). We do this using a map since many identifiers may intersect, but not all will. This works fine for the majority of our users, but for those outliers we end up running out of memory. Since we know the columns are sorted in each row, we could save memory by stepping through the columns for each returned row together, and keep a list of the top X as we add them up. The problem with this is that the Scan api does not give us access to the data in this way. You must always get the next row, then you can batch through the columns for that row, then move on to the next row.
Has anyone dealt with this kind of use case, and is there any way we can implement the above read pattern with current the API or otherwise step through the data? I imagine it isn't a great idea to create a ton of scans (1 for each row), which is the only way I can think to do the above with what we have.
Thanks,
Bryan
-
Re: Dealing with large data sets in client
Stack 2012-03-28, 17:10
On Tue, Mar 27, 2012 at 2:36 PM, Bryan Beaudreault <[EMAIL PROTECTED]> wrote: > I imagine it isn't a great idea to create a ton of scans > (1 for each row), which is the only way I can think to do the above with > what we have. >
You want to step through some set of rows in lock-step? That is, get first N on row A, then first N on row B, etc., then when that is done, go back and step through next N on A, B, and so on?
(Pardon me if I'm being a bit thick -- its early here)
I know of no way to do this other than as you suggest -- a scanner per row (not too bad given your rows are wide) or what about a scan to do first N, then a new scan to do next N... would that work?
St.Ack
-
Re: Dealing with large data sets in client
Bryan Beaudreault 2012-03-28, 17:47
Thanks Stack, that's correct. It is kind of hard to describe, though I guess it's easiest to think of it as a 2d array where the 2nd dimension is sorted.
I think your idea would be doable, too. I'm going to try testing them both and see how well they perform. Luckily I'm not TOO concerned about performance for these outliers, as long as having multiple big scanners like that open at once doesn't degrade performance for other queries as well. I'll update with my findings incase someone else ends up with a similar use case.
On Wed, Mar 28, 2012 at 1:10 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Tue, Mar 27, 2012 at 2:36 PM, Bryan Beaudreault > <[EMAIL PROTECTED]> wrote: > > I imagine it isn't a great idea to create a ton of scans > > (1 for each row), which is the only way I can think to do the above with > > what we have. > > > > You want to step through some set of rows in lock-step? That is, get > first N on row A, then first N on row B, etc., then when that is done, > go back and step through next N on A, B, and so on? > > (Pardon me if I'm being a bit thick -- its early here) > > I know of no way to do this other than as you suggest -- a scanner per > row (not too bad given your rows are wide) or what about a scan to do > first N, then a new scan to do next N... would that work? > > St.Ack >
|
|