Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Filter storing state


+
Corey Nolet 2013-01-03, 22:41
+
Keith Turner 2013-01-03, 22:54
Copy link to this message
-
Re: Filter storing state
That's funny you bring that up- because I was JUST discussing this as a possibility with a coworker. Compaction is really the phase that I'm concerned with- as the API for loading the data from the TopN currently only allows you to load the last N keys/values for a single index at a time.

Can I guarantee that compaction will pass each row through a single filter?
On Jan 3, 2013, at 5:54 PM, Keith Turner wrote:

> Data is read from the iterators into a buffer.  When the buffer fills
> up, the data is sent to the client and the iterators are reinitialized
> to fill up the next buffer.
>
> The default buffer size was changed from 50M to 1M at some point.
> This is configured via the property table.scan.max.memory
>
> The lower buffer size will cause iterator to be reinitialized more
> frequently.  Maybe you are seeing this.
>
> Keith
>
> On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <[EMAIL PROTECTED]> wrote:
>> Hey Guys,
>>
>> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
>> FilteringIterator that would allow us to drop in several keys/values
>> associated with a UUID (similar to a document id). The UUID was further
>> associated with an "index" (or type). The purpose of the TopN table was to
>> keep the keys/values separated so that they could still be queried back with
>> cell-level tagging, but when I performed a query for an index, I would get
>> the last N UUIDs and further be able to query the keys/values for each of
>> those UUIDs.
>>
>> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
>> provide 2 FilteringIterators for compaction time to perform data cleanup of
>> the table so that any keys/values kept around were guaranteed to be inside
>> of the range of those keys being managed by the versioning iterator.
>>
>> Just to recap, I have the following table structure. I also hash the
>> keys/values and run a filter before the versioning iterator to clean up any
>> duplicates. There are two types of columns: index & key/value.
>>
>>
>> Index:
>>
>> R: index (or "type" of data)
>> F: '\x00index'
>> Q: empty
>> V: uuid\x00hashOfKeys&Values
>>
>>
>> Key/Value:
>>
>> R: index (or "type" of data)
>> F: uuid
>> Q: key\x00value
>> V: empty
>>
>>
>> The filtering iterator that makes sure any key/value rows are in the index
>> manages a hashset internally. The index rows are purposefully indexed before
>> the key/value rows so that the filter can build up the hashset containing
>> those uuids in the index. As the filter iterates into the key/value rows, it
>> will return true only if the uuid of the key/value exists inside of the
>> hashset containing the uuids in the index. This worked with older versions
>> of accumulo but I'm now getting a weird artifact where INIT() is called on
>> my Filter in the middle of iterating through an index row.
>>
>> More specifically, the Filter will iterate through the index rows of a
>> specific "index" and build up a hashset, then init() will be called which
>> wipes away the hashset of uuids, then the further goes on to iterate through
>> the key/value rows. Keep in mind, we are talking about maybe 400k entries,
>> not enough to have more than 1 tablet.
>>
>> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
>> know it has got to be a huge nono to be storing state inside of a filter,
>> but I haven't had any issues until trying to update my code for the new
>> version. If I'm doing this completely wrong, any ideas on how to make this
>> better?
>>
>>
>> Thanks!
>>
>>
>> --
>> Corey Nolet
>> Senior Software Engineer
>> TexelTek, inc.
>> [Office] 301.880.7123
>> [Cell] 410-903-2110
+
Keith Turner 2013-01-03, 23:10
+
Corey Nolet 2013-01-03, 23:48
+
John Vines 2013-01-03, 22:53
+
Corey Nolet 2013-01-03, 23:04
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB