Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Delete all data before a given timestamp


Copy link to this message
-
Re: Delete all data before a given timestamp
You might be interested in HBASE-8784 (https://issues.apache.org/jira/browse/HBASE-8784).

----- Original Message -----
From: Chao Shi <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc:
Sent: Monday, July 15, 2013 8:07 PM
Subject: Re: Delete all data before a given timestamp

Jean-Marc Spaggiari <jean-marc@...> writes:

>
> When you send a delete command to the server, you can specify a timestamp.
> So as the result of your MR job,"just" emit this delete with the specific
> timestamp to remove any previous version?
>
> JM
>
> 2013/7/15 Chao Shi <stepinto@...>
>
> > Hi HBase users,
> >
> > We have created a index table (say T2) of another table (say t1). The
> > clients who write to T1 also write a index record to T2 with the same
> > timestamp. There may be accumulated inconsistency as time goes by. So we
> > run a MR job periodically, which fully scans T1, builds a index, and
> > bulk-loads the result to T2.
> >
> > Because the MR job may be running for a while, during the period of
which,
> > all new data into T2 must be kept and not be overridden. So the MR
creates
> > puts using the timestamp the job starts.
> >
> > Then we want all data in T2 before a given timestamp to invisible for
read
> > after the index builds successfully and get deleted eventually (e.g.
during
> > major compaction). We prefer setting it explicitly than using the TTL
> > feature for safety, as we want only old data are deleted only when the
new
> > data is written. Does HBase support this kind of operation for now?
> >
> > Thanks,
> > Chao
> >
>

Hi Jean-Marc,

Thanks for the reply.

I see delete can specify a timestamp, but I don't think that is what I need.
To clarify, in my scenario, I don't want to issue deletes for every key
(because I don't know what exactly to delete unless do another full scan).

I'd like to see if this is possible: set a min_timestamp to
ColumnDescriptor. Once done, KVs before this timestamp become invisible to
read. During major compaction, these KVs are deleted. It is the absolute
version of TTL.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB