The count that displays in the monitor is the sum of all the key/value
pairs that are in the files that back Accumulo. You can also get this count
by doing a scan of the !METADATA table and looking at the values associated
with keys in the "file" column family. Inserting the same key twice could
result in one key in one file or two keys in two files. At query time,
those keys will get deduplicated by the VersioningIterator, providing a
view that only has one key.
45x seems really high, since a tablet tends to have an average of maybe 4-8
files associated with it at the billion entry scale (rough estimate). There
could be other considerations, like cell-level security eliminating entries
from the view that the scanner gives you, or maybe major compactions are
not running properly for you? Your backing data could also include a large
number of deletes, which could throw off the stats. Deletes are implemented
as a tombstone marker, and are only eliminated when a full major compaction
happens. Forcing a major compaction by running the compact command in the
shell should give you some better evidence to diagnose the confusion.
On Wed, Oct 2, 2013 at 4:15 PM, Mastergeek <[EMAIL PROTECTED]> wrote:
> I have an interesting dilemma wherein my Accumulo cluster overview says
> I have over 1.4 billion entries within the table and yet when I run scan
> where I keep track of unique row ids, I get back a number that is
> drastically less than (a little over 30 million) what the table claims to
> have. I read the legend and it says, "Entries: Key/value pairs over each
> instance, table or tablet." I was under the impression that Accumulo tables
> did away with duplicate rows and hence my curiosity as to why there is
> apparently 45 times more entries then there should be. Do I need to perform
> a compaction or some other action to rid my cluster of what I believe to be
> duplicate entries?
> View this message in context:
> Sent from the Developers mailing list archive at Nabble.com.