Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Filter storing state


Copy link to this message
-
Filter storing state
Corey Nolet 2013-01-03, 22:41
Hey Guys,

In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
FilteringIterator that would allow us to drop in several keys/values
associated with a UUID (similar to a document id). The UUID was further
associated with an "index" (or type). The purpose of the TopN table was to
keep the keys/values separated so that they could still be queried back
with cell-level tagging, but when I performed a query for an index, I would
get the last N UUIDs and further be able to query the keys/values for each
of those UUIDs.

This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
provide 2 FilteringIterators for compaction time to perform data cleanup of
the table so that any keys/values kept around were guaranteed to be inside
of the range of those keys being managed by the versioning iterator.

Just to recap, I have the following table structure. I also hash the
keys/values and run a filter before the versioning iterator to clean up any
duplicates. There are two types of columns: index & key/value.
Index:

R: index (or "type" of data)
F: '\x00index'
Q: empty
V: uuid\x00hashOfKeys&Values
Key/Value:

R: index (or "type" of data)
F: uuid
Q: key\x00value
V: empty
The filtering iterator that makes sure any key/value rows are in the index
manages a hashset internally. The index rows are purposefully indexed
before the key/value rows so that the filter can build up the hashset
containing those uuids in the index. As the filter iterates into the
key/value rows, it will return true only if the uuid of the key/value
exists inside of the hashset containing the uuids in the index. This worked
with older versions of accumulo but I'm now getting a weird artifact where
INIT() is called on my Filter in the middle of iterating through an index
row.

More specifically, the Filter will iterate through the index rows of a
specific "index" and build up a hashset, then init() will be called which
wipes away the hashset of uuids, then the further goes on to iterate
through the key/value rows. Keep in mind, we are talking about maybe 400k
entries, not enough to have more than 1 tablet.

Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
know it has got to be a huge nono to be storing state inside of a filter,
but I haven't had any issues until trying to update my code for the new
version. If I'm doing this completely wrong, any ideas on how to make this
better?
Thanks!
--
Corey Nolet
Senior Software Engineer
TexelTek, inc.
[Office] 301.880.7123
[Cell] 410-903-2110