Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Knowing when an iterator is at the "last row/entry"


Copy link to this message
-
Re: Knowing when an iterator is at the "last row/entry"
Terry P. 2014-01-08, 20:53
Hi Keith,
Well, not exactly (you're leaps ahead of me), but that is a great idea!
Meaning, if we could do what you're suggesting, we wouldn't need to run a
weekly maintenance job to actually perform a full major compaction in order
to purge out expired data.  But for our iterator, we simply wanted to keep
tabs on possible bad data (incorrect date formats to be exact) which would
prevent the data from being purged properly by the iterator when a full
major compaction was performed (which we intend to schedule weekly or as
required).

I took a look at ACCUMULO-1266, and that sounds really useful -- but
wouldn't that still rely on having a close() method (or something similar)
in iterators, which is exactly what I have run into as lacking (and for
which you opened ACCUMULO-1280)?
On Wed, Jan 8, 2014 at 11:00 AM, Keith Turner <[EMAIL PROTECTED]> wrote:

>
>
>
> On Wed, Jan 8, 2014 at 10:41 AM, Terry P. <[EMAIL PROTECTED]> wrote:
>
>> Hi Keith,
>> The goal of the iterator is to purge data that has expired (or suppress
>> it for scans). The goal of the log message is to bring to light any data
>> format issues, as otherwise the "bad data" would NOT be purged by the
>> iterator and hang around forever, which would be bad, so yes we would purge
>> it with a special job. The iterator fires at both Full Major Compaction and
>> at Scan time.
>>
>
> So you want to use the summary data from scans to know if you should kick
> off a full major compaction?  In 1.6.0 compaction strategies were added
> (ACCUMULO-1451).  If scans could provide information to these compaction
> strategies, then that would lay the ground work for ACCUMULO-1266 and what
> you are trying to achieve.  I am not sure of the best way to do this.
>  Maybe when a scan iterator is closed it could update counters (maybe
> counters encourage small memory usage).  The compaction strategy could
> access the counters and use them to make a decision about doing a full
> major compaction.
>
>
>>
>> Good point on "How did the bad data get there?" -- it shouldn't based on
>> how items are indexed and then inserted into Accumulo, but I wanted to
>> check for it in case the individual that installs the iterator in Accumulo
>> fat-fingers the date format, OR if someone changes it on the other side
>> (the app that sends the data to Accumulo). The first one could happen
>> easily, but the latter shouldn't happen. But as folks roll off programs and
>> others maintain the code, anything can happen.
>>
>
>> Looks like ACCUMULO-1280 is exactly what I need! Maybe someday, but until
>> then what I have for the iterator will do the job (and thanks again for
>> your help on it!).
>>
>> Best regards,
>> Terry
>>
>> On Wed, Jan 8, 2014 at 9:30 AM, Keith Turner <[EMAIL PROTECTED]> wrote:
>>
>>> whats is your goal?  It seems like you want to produce counts about bad
>>> data suppressed at scan time.  What will you do with these counts?  Will
>>> you ever purge the bad data?  How did the bad data get there?  If you are
>>> not bulk importing the data, then maybe you could add constraints to the
>>> table?
>>>
>>>
>>>  On Mon, Jan 6, 2014 at 7:30 PM, Terry P. <[EMAIL PROTECTED]> wrote:
>>>
>>>> Greetings folks,
>>>> I have an iterator that extends RowFilter and I have a case where I
>>>> need to know when its defined date format doesn't match the format of the
>>>> data being scanned by the iterator.  I don't want to flood the tserver log
>>>> with an error per row (how horrid that would be), but instead keep a
>>>> counter of the number of times that error occurs during a scan or major
>>>> compaction.
>>>>
>>>> Trouble is, I don't see any way to know when an iterator is on the
>>>> "last row" or "last entry" in its scan on a tabletserver, as if I could
>>>> test for that, I could then dump my single log message with the count of
>>>> date format parse errors for that scan/compaction.
>>>>
>>>> Anyone know a way to determine if an iterator is at the "last entry" or
>>>> "last row" of its execution?