|
|
-
Scanning the last N rows
Peter Wolf 2012-03-02, 21:02
Hello all,
I want to retrieve the most recent N rows from a table, with some column qualifiers.
I can't find a Filter, or anything obvious in my books, or via Google.
What is the idiom for doing this?
Thanks Peter
-
Re: Scanning the last N rows
Shaneal Manek 2012-03-02, 21:20
Assuming your rowkey doesn't somehow encode the time that row was created (in which case you can simply do a scan), things get a bit more interesting.
The 'easiest' approach is probably to Scan, but use a custom filter that only allows in 'recent' rows based on their timestamp (see the TimestampsFilter for an example of how to do this - it isn't exactly what you need, but should show you how) so that you expect at least N rows to match. Then, if your scan matched at least N row, you can sort and take the top N client side. If your scan retrieved less than N row, so you'll have go back and do another scan with a different timestamp filter and aggregate/sort the results from the multiple scans.
The more efficient approach might be to create a second table as a 'recency' index. Let's pretend your data table is called 'd'. Then, you'd created a second table called 'dri' (data recency index). Every time you insert a row into 'd' with a rowkey of 'r', you also insert a row into 'dri' with a rowkey of the current timestamp, and only one column (say, called 'dr') with a value of 'r'. Then, when you want to retrieve the last N rows, you can look at the last N rows in the dri table, and GET the rows from the 'd' table with row keys matching the column values in 'dr'. You can automate some of this with coprocessors too.
Of course, the easiest way is to simply make the most significant bits of your rowkey in your actual data be a timestamp, but I don't know if your schema would allow that.
-Shaneal On Fri, Mar 2, 2012 at 1:02 PM, Peter Wolf <[EMAIL PROTECTED]> wrote: > Hello all, > > I want to retrieve the most recent N rows from a table, with some column > qualifiers. > > I can't find a Filter, or anything obvious in my books, or via Google. > > What is the idiom for doing this? > > Thanks > Peter
-
Re: Scanning the last N rows
Peter Wolf 2012-03-02, 21:31
Thanks Shaneal,
My rows are created by customer interaction. Unfortunately, I am not interested in rows from a region of time (i.e. "now" .. "a month ago). Instead I want the last N interactions.
Let's say I incorporated an interaction count into the key, and I want to get most recent 1000 rows. I can then do a simple scan with start and stop partial row keys.
But how do I get the interaction count value of the most recent row?
P On 3/2/12 4:20 PM, Shaneal Manek wrote: > Assuming your rowkey doesn't somehow encode the time that row was > created (in which case you can simply do a scan), things get a bit > more interesting. > > The 'easiest' approach is probably to Scan, but use a custom filter > that only allows in 'recent' rows based on their timestamp (see the > TimestampsFilter for an example of how to do this - it isn't exactly > what you need, but should show you how) so that you expect at least N > rows to match. Then, if your scan matched at least N row, you can sort > and take the top N client side. If your scan retrieved less than N > row, so you'll have go back and do another scan with a different > timestamp filter and aggregate/sort the results from the multiple > scans. > > The more efficient approach might be to create a second table as a > 'recency' index. Let's pretend your data table is called 'd'. Then, > you'd created a second table called 'dri' (data recency index). Every > time you insert a row into 'd' with a rowkey of 'r', you also insert a > row into 'dri' with a rowkey of the current timestamp, and only one > column (say, called 'dr') with a value of 'r'. Then, when you want to > retrieve the last N rows, you can look at the last N rows in the dri > table, and GET the rows from the 'd' table with row keys matching the > column values in 'dr'. You can automate some of this with coprocessors > too. > > Of course, the easiest way is to simply make the most significant bits > of your rowkey in your actual data be a timestamp, but I don't know if > your schema would allow that. > > -Shaneal > > > On Fri, Mar 2, 2012 at 1:02 PM, Peter Wolf<[EMAIL PROTECTED]> wrote: >> Hello all, >> >> I want to retrieve the most recent N rows from a table, with some column >> qualifiers. >> >> I can't find a Filter, or anything obvious in my books, or via Google. >> >> What is the idiom for doing this? >> >> Thanks >> Peter
-
Re: Scanning the last N rows
Doug Meil 2012-03-02, 21:31
Hi there- Take a look at this section of the book... http://hbase.apache.org/book.html#reverse.timestampOn 3/2/12 4:02 PM, "Peter Wolf" <[EMAIL PROTECTED]> wrote: >Hello all, > >I want to retrieve the most recent N rows from a table, with some column >qualifiers. > >I can't find a Filter, or anything obvious in my books, or via Google. > >What is the idiom for doing this? > >Thanks >Peter >
-
Re: Scanning the last N rows
Doug Meil 2012-03-02, 21:32
Reference Guide, I mean. Not "Book." Reference Guide. :-) On 3/2/12 4:31 PM, "Doug Meil" <[EMAIL PROTECTED]> wrote: > >Hi there- > >Take a look at this section of the book... > > http://hbase.apache.org/book.html#reverse.timestamp> > > > >On 3/2/12 4:02 PM, "Peter Wolf" <[EMAIL PROTECTED]> wrote: > >>Hello all, >> >>I want to retrieve the most recent N rows from a table, with some column >>qualifiers. >> >>I can't find a Filter, or anything obvious in my books, or via Google. >> >>What is the idiom for doing this? >> >>Thanks >>Peter >> >
-
Re: Scanning the last N rows
Peter Wolf 2012-03-02, 21:42
Ah ha! So the row key orders the results, I just do an unbounded Scan, and stop after N iterations. Like this... Scan scan = new Scan(); Filter filter = new SingleColumnValueFilter(...); scan.setFilter(filter); ResultScanner scanner = hTable.getScanner(scan); Iterator<Result> it = scanner.iterator(); for ( int i=0; i<1000 && it.hasNext(); i++) { Result result = it.next(); ... do stuff with result... } Do I have to worry about efficiency? Is the Server madly retrieving rows, in the background, that the Client will never use? Thanks P On 3/2/12 4:31 PM, Doug Meil wrote: > Hi there- > > Take a look at this section of the book... > > http://hbase.apache.org/book.html#reverse.timestamp> > > > > On 3/2/12 4:02 PM, "Peter Wolf"<[EMAIL PROTECTED]> wrote: > >> Hello all, >> >> I want to retrieve the most recent N rows from a table, with some column >> qualifiers. >> >> I can't find a Filter, or anything obvious in my books, or via Google. >> >> What is the idiom for doing this? >> >> Thanks >> Peter >> >
-
Re: Scanning the last N rows
Ian Varley 2012-03-02, 21:49
Yes, you do have to worry about efficiency. If your rows aren't ordered in the table (by rowkey) according to the update date, the server will be having to scan the entire table. Your filter will enable it to not send all of those results to the client, but it's still having to read them from disk and merge them with the rows in memory. It will likely not even be possible for a big table (and, if it's not a *big* table, it probably shouldn't be in HBase). The fundamental thing to note here is that there's no "magic": HBase stores records sorted in exactly one order; if what you want isn't able to be efficiently found according to that ordering, then you'll be scanning the whole table. Relational DBs do that too, but they also have indexes that let you get at things quickly in some other sort order. Ian On Mar 2, 2012, at 3:42 PM, Peter Wolf wrote: Ah ha! So the row key orders the results, I just do an unbounded Scan, and stop after N iterations. Like this... Scan scan = new Scan(); Filter filter = new SingleColumnValueFilter(...); scan.setFilter(filter); ResultScanner scanner = hTable.getScanner(scan); Iterator<Result> it = scanner.iterator(); for ( int i=0; i<1000 && it.hasNext(); i++) { Result result = it.next(); ... do stuff with result... } Do I have to worry about efficiency? Is the Server madly retrieving rows, in the background, that the Client will never use? Thanks P On 3/2/12 4:31 PM, Doug Meil wrote: Hi there- Take a look at this section of the book... http://hbase.apache.org/book.html#reverse.timestampOn 3/2/12 4:02 PM, "Peter Wolf"<[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hello all, I want to retrieve the most recent N rows from a table, with some column qualifiers. I can't find a Filter, or anything obvious in my books, or via Google. What is the idiom for doing this? Thanks Peter
-
Re: Scanning the last N rows
Peter Wolf 2012-03-02, 21:59
Sorry, my code was a little off. It should have been Scan scan = new Scan(calculateStartRowKey(targetAccount), calculateEndRowKey(targetAccount)); Where my row key is formed from <account><reverse timestamp> So, the scanner would match all the rows for this account, and return them most recent first. Iterator<Result> it = scanner.iterator(); But if I stop doing this... Result result = it.next(); Will that be efficient? Will the scanner potentially matching all rows for the account be a problem? P On 3/2/12 4:49 PM, Ian Varley wrote: > Yes, you do have to worry about efficiency. If your rows aren't ordered in the table (by rowkey) according to the update date, the server will be having to scan the entire table. Your filter will enable it to not send all of those results to the client, but it's still having to read them from disk and merge them with the rows in memory. It will likely not even be possible for a big table (and, if it's not a *big* table, it probably shouldn't be in HBase). > > The fundamental thing to note here is that there's no "magic": HBase stores records sorted in exactly one order; if what you want isn't able to be efficiently found according to that ordering, then you'll be scanning the whole table. Relational DBs do that too, but they also have indexes that let you get at things quickly in some other sort order. > > Ian > > On Mar 2, 2012, at 3:42 PM, Peter Wolf wrote: > > > Ah ha! So the row key orders the results, I just do an unbounded Scan, > and stop after N iterations. > > Like this... > > Scan scan = new Scan(); > Filter filter = new SingleColumnValueFilter(...); > scan.setFilter(filter); > ResultScanner scanner = hTable.getScanner(scan); > Iterator<Result> it = scanner.iterator(); > for ( int i=0; i<1000&& it.hasNext(); i++) { > Result result = it.next(); > ... do stuff with result... > } > > Do I have to worry about efficiency? Is the Server madly retrieving > rows, in the background, that the Client will never use? > > Thanks > P > > > > On 3/2/12 4:31 PM, Doug Meil wrote: > Hi there- > > Take a look at this section of the book... > > http://hbase.apache.org/book.html#reverse.timestamp> > > > > On 3/2/12 4:02 PM, "Peter Wolf"<[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > > Hello all, > > I want to retrieve the most recent N rows from a table, with some column > qualifiers. > > I can't find a Filter, or anything obvious in my books, or via Google. > > What is the idiom for doing this? > > Thanks > Peter > > > > >
-
Re: Scanning the last N rows
Doug Meil 2012-03-02, 22:00
One other thing, for Scans read the part about Scan-caching. http://hbase.apache.org/book.html#perf.readingOn 3/2/12 4:49 PM, "Ian Varley" <[EMAIL PROTECTED]> wrote: >Yes, you do have to worry about efficiency. If your rows aren't ordered >in the table (by rowkey) according to the update date, the server will be >having to scan the entire table. Your filter will enable it to not send >all of those results to the client, but it's still having to read them >from disk and merge them with the rows in memory. It will likely not even >be possible for a big table (and, if it's not a *big* table, it probably >shouldn't be in HBase). > >The fundamental thing to note here is that there's no "magic": HBase >stores records sorted in exactly one order; if what you want isn't able >to be efficiently found according to that ordering, then you'll be >scanning the whole table. Relational DBs do that too, but they also have >indexes that let you get at things quickly in some other sort order. > >Ian > >On Mar 2, 2012, at 3:42 PM, Peter Wolf wrote: > > >Ah ha! So the row key orders the results, I just do an unbounded Scan, >and stop after N iterations. > >Like this... > > Scan scan = new Scan(); > Filter filter = new SingleColumnValueFilter(...); > scan.setFilter(filter); > ResultScanner scanner = hTable.getScanner(scan); > Iterator<Result> it = scanner.iterator(); > for ( int i=0; i<1000 && it.hasNext(); i++) { > Result result = it.next(); > ... do stuff with result... > } > >Do I have to worry about efficiency? Is the Server madly retrieving >rows, in the background, that the Client will never use? > >Thanks >P > > > >On 3/2/12 4:31 PM, Doug Meil wrote: >Hi there- > >Take a look at this section of the book... > > http://hbase.apache.org/book.html#reverse.timestamp> > > > >On 3/2/12 4:02 PM, "Peter >Wolf"<[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > >Hello all, > >I want to retrieve the most recent N rows from a table, with some column >qualifiers. > >I can't find a Filter, or anything obvious in my books, or via Google. > >What is the idiom for doing this? > >Thanks >Peter > > > >
|
|