Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> 'Redundant' mutations


Copy link to this message
-
Re: 'Redundant' mutations
short answer: yes on disk these redundant keys are removed eventually

On Feb 9, 2012, at 10:14 AM, Keith Turner wrote:

> On Thu, Feb 9, 2012 at 9:50 AM, Benson Margulies <[EMAIL PROTECTED]> wrote:
>> On Thu, Feb 9, 2012 at 9:47 AM, Aaron Cordova <[EMAIL PROTECTED]> wrote:
>>> You get "a"
>>>
>>> By default tables are configured with a "versioning iterator" that filters out all but the latest "version" of a key, meaning the key with the latest timestamp, which provides the cleaning out of redundant keys that differ only in timestamp behavior you describe
>>
>> I understood that the default was only to see the latest, but does
>> disk space remain consumed with older ones until something happens, or
>> does it clean out itself?
>> .
>>>
>>>
>>> On Feb 9, 2012, at 9:43 AM, Benson Margulies wrote:
>>>
>>>> At time 0, I make a Mutation with put("a", "b", "c");
>>>>
>>>> At time 1, I do it again.
>>>>
>>>> Do I get:
>>>>
>>>> a) two copies of the same data with different timestamps?
>>>>
>>>> b) an error?
>>>>
>>>> c) something else?
>>>>
>>>> If the idea I'm looking for is to end up with one item without doing a
>>>> scan each time to see if it's out there, is there a 'garbage
>>>> collection' cliche for cleaning out redundant items that differ only
>>>> in timestamp?
>>>
>
> It depends on a few factors.
>  * If the two mutations were written to the same in memory map, when
> it is minor compacted only one is written out.
>  * If the two mutations were written to different in memory maps,
> then the data will be minor compacted to separate files.  In this case
> it will not go away until a major compactions occurs (merges multiple
> files, controlled by the major compaction ratio).  This can be caused
> by additional data being written or a user forcing major compaction on
> a table.