Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Filter storing state


Copy link to this message
-
Re: Filter storing state
That's funny you bring that up- because I was JUST discussing this as a possibility with a coworker. Compaction is really the phase that I'm concerned with- as the API for loading the data from the TopN currently only allows you to load the last N keys/values for a single index at a time.

Can I guarantee that compaction will pass each row through a single filter?
On Jan 3, 2013, at 5:54 PM, Keith Turner wrote:

> Data is read from the iterators into a buffer.  When the buffer fills
> up, the data is sent to the client and the iterators are reinitialized
> to fill up the next buffer.
>
> The default buffer size was changed from 50M to 1M at some point.
> This is configured via the property table.scan.max.memory
>
> The lower buffer size will cause iterator to be reinitialized more
> frequently.  Maybe you are seeing this.
>
> Keith
>
> On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <[EMAIL PROTECTED]> wrote:
>> Hey Guys,
>>
>> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
>> FilteringIterator that would allow us to drop in several keys/values
>> associated with a UUID (similar to a document id). The UUID was further
>> associated with an "index" (or type). The purpose of the TopN table was to
>> keep the keys/values separated so that they could still be queried back with
>> cell-level tagging, but when I performed a query for an index, I would get
>> the last N UUIDs and further be able to query the keys/values for each of
>> those UUIDs.
>>
>> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
>> provide 2 FilteringIterators for compaction time to perform data cleanup of
>> the table so that any keys/values kept around were guaranteed to be inside
>> of the range of those keys being managed by the versioning iterator.
>>
>> Just to recap, I have the following table structure. I also hash the
>> keys/values and run a filter before the versioning iterator to clean up any
>> duplicates. There are two types of columns: index & key/value.
>>
>>
>> Index:
>>
>> R: index (or "type" of data)
>> F: '\x00index'
>> Q: empty
>> V: uuid\x00hashOfKeys&Values
>>
>>
>> Key/Value:
>>
>> R: index (or "type" of data)
>> F: uuid
>> Q: key\x00value
>> V: empty
>>
>>
>> The filtering iterator that makes sure any key/value rows are in the index
>> manages a hashset internally. The index rows are purposefully indexed before
>> the key/value rows so that the filter can build up the hashset containing
>> those uuids in the index. As the filter iterates into the key/value rows, it
>> will return true only if the uuid of the key/value exists inside of the
>> hashset containing the uuids in the index. This worked with older versions
>> of accumulo but I'm now getting a weird artifact where INIT() is called on
>> my Filter in the middle of iterating through an index row.
>>
>> More specifically, the Filter will iterate through the index rows of a
>> specific "index" and build up a hashset, then init() will be called which
>> wipes away the hashset of uuids, then the further goes on to iterate through
>> the key/value rows. Keep in mind, we are talking about maybe 400k entries,
>> not enough to have more than 1 tablet.
>>
>> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
>> know it has got to be a huge nono to be storing state inside of a filter,
>> but I haven't had any issues until trying to update my code for the new
>> version. If I'm doing this completely wrong, any ideas on how to make this
>> better?
>>
>>
>> Thanks!
>>
>>
>> --
>> Corey Nolet
>> Senior Software Engineer
>> TexelTek, inc.
>> [Office] 301.880.7123
>> [Cell] 410-903-2110