Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Delete all data before a given timestamp


Copy link to this message
-
Re: Delete all data before a given timestamp
Chao Shi 2013-07-17, 03:35
Yes, this is what we did now. We maintained a lower bound of timestamp for
scan. Once an index build is done, we increase it to a higher value.
On Wed, Jul 17, 2013 at 2:50 AM, Jimmy Xiang <[EMAIL PROTECTED]> wrote:

> When you set up the MR, does it help to set a proper timestamp filter or
> time range in the scan object?
>
>
> On Tue, Jul 16, 2013 at 5:59 AM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]> wrote:
>
> > Another option might be to setup the proper TTL on the table? You alter
> the
> > table to set the TTL to reflect your timestamp, the you run a compaction?
> > The issue is that you have to disable the table while you alter it.
> >
> > JM
> >
> > 2013/7/16 Ted Yu <[EMAIL PROTECTED]>
> >
> > > Would this method (of Delete) serve your need ?
> > >
> > >   public Delete deleteFamily(byte [] family, long timestamp) {
> > > From its Javadoc:
> > >
> > >    * Delete all columns of the specified family with a timestamp less
> > than
> > >
> > >    * or equal to the specified timestamp.
> > >
> > > On Mon, Jul 15, 2013 at 8:07 PM, Chao Shi <[EMAIL PROTECTED]> wrote:
> > >
> > > > Jean-Marc Spaggiari <jean-marc@...> writes:
> > > >
> > > > >
> > > > > When you send a delete command to the server, you can specify a
> > > > timestamp.
> > > > > So as the result of your MR job,"just" emit this delete with the
> > > specific
> > > > > timestamp to remove any previous version?
> > > > >
> > > > > JM
> > > > >
> > > > > 2013/7/15 Chao Shi <stepinto@...>
> > > > >
> > > > > > Hi HBase users,
> > > > > >
> > > > > > We have created a index table (say T2) of another table (say t1).
> > The
> > > > > > clients who write to T1 also write a index record to T2 with the
> > same
> > > > > > timestamp. There may be accumulated inconsistency as time goes
> by.
> > So
> > > > we
> > > > > > run a MR job periodically, which fully scans T1, builds a index,
> > and
> > > > > > bulk-loads the result to T2.
> > > > > >
> > > > > > Because the MR job may be running for a while, during the period
> of
> > > > which,
> > > > > > all new data into T2 must be kept and not be overridden. So the
> MR
> > > > creates
> > > > > > puts using the timestamp the job starts.
> > > > > >
> > > > > > Then we want all data in T2 before a given timestamp to invisible
> > for
> > > > read
> > > > > > after the index builds successfully and get deleted eventually
> > (e.g.
> > > > during
> > > > > > major compaction). We prefer setting it explicitly than using the
> > TTL
> > > > > > feature for safety, as we want only old data are deleted only
> when
> > > the
> > > > new
> > > > > > data is written. Does HBase support this kind of operation for
> now?
> > > > > >
> > > > > > Thanks,
> > > > > > Chao
> > > > > >
> > > > >
> > > >
> > > > Hi Jean-Marc,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > I see delete can specify a timestamp, but I don't think that is what
> I
> > > > need.
> > > > To clarify, in my scenario, I don't want to issue deletes for every
> key
> > > > (because I don't know what exactly to delete unless do another full
> > > scan).
> > > >
> > > > I'd like to see if this is possible: set a min_timestamp to
> > > > ColumnDescriptor. Once done, KVs before this timestamp become
> invisible
> > > to
> > > > read. During major compaction, these KVs are deleted. It is the
> > absolute
> > > > version of TTL.
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>