Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Delete all data before a given timestamp


Copy link to this message
-
Re: Delete all data before a given timestamp
Jimmy Xiang 2013-07-16, 18:50
When you set up the MR, does it help to set a proper timestamp filter or
time range in the scan object?
On Tue, Jul 16, 2013 at 5:59 AM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:

> Another option might be to setup the proper TTL on the table? You alter the
> table to set the TTL to reflect your timestamp, the you run a compaction?
> The issue is that you have to disable the table while you alter it.
>
> JM
>
> 2013/7/16 Ted Yu <[EMAIL PROTECTED]>
>
> > Would this method (of Delete) serve your need ?
> >
> >   public Delete deleteFamily(byte [] family, long timestamp) {
> > From its Javadoc:
> >
> >    * Delete all columns of the specified family with a timestamp less
> than
> >
> >    * or equal to the specified timestamp.
> >
> > On Mon, Jul 15, 2013 at 8:07 PM, Chao Shi <[EMAIL PROTECTED]> wrote:
> >
> > > Jean-Marc Spaggiari <jean-marc@...> writes:
> > >
> > > >
> > > > When you send a delete command to the server, you can specify a
> > > timestamp.
> > > > So as the result of your MR job,"just" emit this delete with the
> > specific
> > > > timestamp to remove any previous version?
> > > >
> > > > JM
> > > >
> > > > 2013/7/15 Chao Shi <stepinto@...>
> > > >
> > > > > Hi HBase users,
> > > > >
> > > > > We have created a index table (say T2) of another table (say t1).
> The
> > > > > clients who write to T1 also write a index record to T2 with the
> same
> > > > > timestamp. There may be accumulated inconsistency as time goes by.
> So
> > > we
> > > > > run a MR job periodically, which fully scans T1, builds a index,
> and
> > > > > bulk-loads the result to T2.
> > > > >
> > > > > Because the MR job may be running for a while, during the period of
> > > which,
> > > > > all new data into T2 must be kept and not be overridden. So the MR
> > > creates
> > > > > puts using the timestamp the job starts.
> > > > >
> > > > > Then we want all data in T2 before a given timestamp to invisible
> for
> > > read
> > > > > after the index builds successfully and get deleted eventually
> (e.g.
> > > during
> > > > > major compaction). We prefer setting it explicitly than using the
> TTL
> > > > > feature for safety, as we want only old data are deleted only when
> > the
> > > new
> > > > > data is written. Does HBase support this kind of operation for now?
> > > > >
> > > > > Thanks,
> > > > > Chao
> > > > >
> > > >
> > >
> > > Hi Jean-Marc,
> > >
> > > Thanks for the reply.
> > >
> > > I see delete can specify a timestamp, but I don't think that is what I
> > > need.
> > > To clarify, in my scenario, I don't want to issue deletes for every key
> > > (because I don't know what exactly to delete unless do another full
> > scan).
> > >
> > > I'd like to see if this is possible: set a min_timestamp to
> > > ColumnDescriptor. Once done, KVs before this timestamp become invisible
> > to
> > > read. During major compaction, these KVs are deleted. It is the
> absolute
> > > version of TTL.
> > >
> > >
> > >
> > >
> > >
> >
>