Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Filter storing state


Copy link to this message
-
Filter storing state
Hey Guys,

In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
FilteringIterator that would allow us to drop in several keys/values
associated with a UUID (similar to a document id). The UUID was further
associated with an "index" (or type). The purpose of the TopN table was to
keep the keys/values separated so that they could still be queried back
with cell-level tagging, but when I performed a query for an index, I would
get the last N UUIDs and further be able to query the keys/values for each
of those UUIDs.

This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
provide 2 FilteringIterators for compaction time to perform data cleanup of
the table so that any keys/values kept around were guaranteed to be inside
of the range of those keys being managed by the versioning iterator.

Just to recap, I have the following table structure. I also hash the
keys/values and run a filter before the versioning iterator to clean up any
duplicates. There are two types of columns: index & key/value.
Index:

R: index (or "type" of data)
F: '\x00index'
Q: empty
V: uuid\x00hashOfKeys&Values
Key/Value:

R: index (or "type" of data)
F: uuid
Q: key\x00value
V: empty
The filtering iterator that makes sure any key/value rows are in the index
manages a hashset internally. The index rows are purposefully indexed
before the key/value rows so that the filter can build up the hashset
containing those uuids in the index. As the filter iterates into the
key/value rows, it will return true only if the uuid of the key/value
exists inside of the hashset containing the uuids in the index. This worked
with older versions of accumulo but I'm now getting a weird artifact where
INIT() is called on my Filter in the middle of iterating through an index
row.

More specifically, the Filter will iterate through the index rows of a
specific "index" and build up a hashset, then init() will be called which
wipes away the hashset of uuids, then the further goes on to iterate
through the key/value rows. Keep in mind, we are talking about maybe 400k
entries, not enough to have more than 1 tablet.

Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
know it has got to be a huge nono to be storing state inside of a filter,
but I haven't had any issues until trying to update my code for the new
version. If I'm doing this completely wrong, any ideas on how to make this
better?
Thanks!
--
Corey Nolet
Senior Software Engineer
TexelTek, inc.
[Office] 301.880.7123
[Cell] 410-903-2110
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB