|
|
-
how to use CountingIterator to count records?
Hunter Provyn 2012-06-06, 17:39
I want to know the number of records a scanner has without actually getting the records from cloudbase. I've been looking at CountingIterator (1.3.4), which has a getCount() method. However, I don't know how to access the instance to call getCount() on it because Cloudbase server just passes back the entries and doesn't expose the instance of the iterator.
It is possible to use an AggregatingIterator to aggregate all entries into a single entry whose value is the number of entries. But I was wondering if there was a better way that possibly makes use of the CountingIterator class.
-
Re: how to use CountingIterator to count records?
William Slacum 2012-06-06, 17:46
You're kind of there. Essentially, you can think of your Scanner's interactions with the TServers as a tree with a height of two. Your Scanner is the "root" and its children are all of the TServers it needs to interact with. Essentially, the operation you'd want to is sum the number of records each of the children have.
In Accumulo terms, you can use something like a CountingIterator to count the number of results on each TServer. You can then sum all of those intermediate results to get a total count of results.
On Wed, Jun 6, 2012 at 10:39 AM, Hunter Provyn <[EMAIL PROTECTED]> wrote: > I want to know the number of records a scanner has without actually getting > the records from cloudbase. > I've been looking at CountingIterator (1.3.4), which has a getCount() > method. However, I don't know how > to access the instance to call getCount() on it because Cloudbase server > just passes back the entries and doesn't expose the instance of the > iterator. > > It is possible to use an AggregatingIterator to aggregate all entries into a > single entry whose value is the number of entries. But I was wondering if > there was a better way that possibly makes use of the CountingIterator > class. >
-
Re: how to use CountingIterator to count records?
Keith Turner 2012-06-06, 17:56
The counting Iterator is not intended for end users. Its used internally to count how many key values major compactions read. We documented this better in 1.4 by putting it in a iterators.system package.
You could write a wrapping iterator that does this counting. It wold count until its source iterator had not more values. Then it would return the last key from the source with a count. Its important that you return the last key for the continuance case. After you write this iterator you can follow Bills advice to pull everything together on the client side.
Keith
On Wed, Jun 6, 2012 at 1:46 PM, William Slacum <[EMAIL PROTECTED]> wrote: > You're kind of there. Essentially, you can think of your Scanner's > interactions with the TServers as a tree with a height of two. Your > Scanner is the "root" and its children are all of the TServers it > needs to interact with. Essentially, the operation you'd want to is > sum the number of records each of the children have. > > In Accumulo terms, you can use something like a CountingIterator to > count the number of results on each TServer. You can then sum all of > those intermediate results to get a total count of results. > > On Wed, Jun 6, 2012 at 10:39 AM, Hunter Provyn <[EMAIL PROTECTED]> wrote: >> I want to know the number of records a scanner has without actually getting >> the records from cloudbase. >> I've been looking at CountingIterator (1.3.4), which has a getCount() >> method. However, I don't know how >> to access the instance to call getCount() on it because Cloudbase server >> just passes back the entries and doesn't expose the instance of the >> iterator. >> >> It is possible to use an AggregatingIterator to aggregate all entries into a >> single entry whose value is the number of entries. But I was wondering if >> there was a better way that possibly makes use of the CountingIterator >> class. >>
-
Re: how to use CountingIterator to count records?
Keith Turner 2012-06-06, 18:02
On Wed, Jun 6, 2012 at 1:46 PM, William Slacum <[EMAIL PROTECTED]> wrote: > You're kind of there. Essentially, you can think of your Scanner's > interactions with the TServers as a tree with a height of two. Your
One comment to add. The Scanner will do this work serially, one tablet server at a time. The batch scanner would execute the iterator in parallel on multiple tablet servers at a time. > Scanner is the "root" and its children are all of the TServers it > needs to interact with. Essentially, the operation you'd want to is > sum the number of records each of the children have. > > In Accumulo terms, you can use something like a CountingIterator to > count the number of results on each TServer. You can then sum all of > those intermediate results to get a total count of results. > > On Wed, Jun 6, 2012 at 10:39 AM, Hunter Provyn <[EMAIL PROTECTED]> wrote: >> I want to know the number of records a scanner has without actually getting >> the records from cloudbase. >> I've been looking at CountingIterator (1.3.4), which has a getCount() >> method. However, I don't know how >> to access the instance to call getCount() on it because Cloudbase server >> just passes back the entries and doesn't expose the instance of the >> iterator. >> >> It is possible to use an AggregatingIterator to aggregate all entries into a >> single entry whose value is the number of entries. But I was wondering if >> there was a better way that possibly makes use of the CountingIterator >> class. >>
-
RE: how to use CountingIterator to count records?
Bob.Thorman@... 2012-06-07, 12:55
Hunter
If you have access to the ingest of this data, have you considered implementing an Edge Table to keep the count based on a document partition index (or similar aggregate key)? I have to keep up with the same statistic and have moved to the Edge Table approach for a direct look up of occurrences.
-----Original Message----- From: Keith Turner [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 06, 2012 13:03 To: [EMAIL PROTECTED] Subject: Re: how to use CountingIterator to count records?
On Wed, Jun 6, 2012 at 1:46 PM, William Slacum <[EMAIL PROTECTED]> wrote: > You're kind of there. Essentially, you can think of your Scanner's > interactions with the TServers as a tree with a height of two. Your
One comment to add. The Scanner will do this work serially, one tablet server at a time. The batch scanner would execute the iterator in parallel on multiple tablet servers at a time. > Scanner is the "root" and its children are all of the TServers it > needs to interact with. Essentially, the operation you'd want to is > sum the number of records each of the children have. > > In Accumulo terms, you can use something like a CountingIterator to > count the number of results on each TServer. You can then sum all of > those intermediate results to get a total count of results. > > On Wed, Jun 6, 2012 at 10:39 AM, Hunter Provyn <[EMAIL PROTECTED]> wrote: >> I want to know the number of records a scanner has without actually >> getting the records from cloudbase. >> I've been looking at CountingIterator (1.3.4), which has a getCount() >> method. However, I don't know how to access the instance to call >> getCount() on it because Cloudbase server just passes back the >> entries and doesn't expose the instance of the iterator. >> >> It is possible to use an AggregatingIterator to aggregate all entries >> into a single entry whose value is the number of entries. But I was >> wondering if there was a better way that possibly makes use of the >> CountingIterator class. >>
-
Re: how to use CountingIterator to count records?
David Medinets 2012-06-07, 14:00
Can you describe the Edge Table approach or provide a reference?
On Thu, Jun 7, 2012 at 8:55 AM, <[EMAIL PROTECTED]> wrote: > have moved to the Edge Table approach for a direct look up of occurrences.
-
RE: how to use CountingIterator to count records?
Bob.Thorman@... 2012-06-07, 15:25
It's an adaptation of a feature table where the weight is the number of occurrences found during ingest. The rowId's are features that are relevant to my queries/row counts (e.g. timespan, geo-space, document partition id, keywords, etc.)
Example:
ROWID FAM QUAL VIS VALUE ===== === ==== === ====White KEYWORD OTHER public 123 14SU GEO MGRS public 456 9223 TIMESPAN EPOC public 7890 DOCPART1 DOCUMENT PARTITION public 1234567 One tablet server will know how many rows exist across the cluster for any ROWID. So I can quickly determine how many rows exist in all my tablet servers with one simple scan.
Obviously you have counter them all on ingest and update the edge table. -----Original Message----- From: David Medinets [mailto:[EMAIL PROTECTED]] Sent: Thursday, June 07, 2012 09:00 To: [EMAIL PROTECTED] Subject: Re: how to use CountingIterator to count records?
Can you describe the Edge Table approach or provide a reference?
On Thu, Jun 7, 2012 at 8:55 AM, <[EMAIL PROTECTED]> wrote: > have moved to the Edge Table approach for a direct look up of occurrences.
|
|