Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - Efficiently wiping out random data?


Copy link to this message
-
Re: Efficiently wiping out random data?
Todd Lipcon 2013-06-19, 16:27
I'd also question what exactly the regulatory requirements for deletion
are. For example, if you had tape backups of your Oracle DB, would you have
to drive to your off-site storage facility, grab every tape you ever made,
and zero out the user's data as well? I doubt it, considering tapes have
basically the same storage characteristics as HDFS in terms of inability to
random write.

Another example: deletes work the same way in most databases -- eg in
postgres, deletion of a record just consists of setting a record's "xmax"
column to the current transaction ID. This is equivalent to a tombstone,
and you have to wait for a VACUUM process to come along and actually delete
the record entry. In Oracle, the record will persist in a rollback segment
for a configurable amount of time, and you can use a Flashback query to
time travel and see it again. In Vertica, you also set an "xmax" entry and
wait until the next merge-out (like a major compaction).

Even in a filesystem, deletion doesn't typically remove data, unless you
use a tool like srm. It just unlinks the inode from the directory tree.

So, if any of the above systems satisfy their use case, then HBase ought to
as well. Perhaps there's an ACL we could add which would allow/disallow
users from doing time travel more than N seconds in the past..  maybe that
would help allay fears?

-Todd

On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <[EMAIL PROTECTED]>wrote:

> Chances are that date isn't completely "random". For instance, with a user
> they are likely to have an id in their row key, so doing a filtering (with
> a custom scanner) major compaction would clean that up. With Sergey's
> compaction stuff coming in you could break that out even further and only
> have to compact a small set of files to get that removal.
>
> So it's hard, but as its not our direct use case, it's gonna be a few extra
> hoops.
>
> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>
> > Yeah, the immutable nature of HDFS is biting us here.
> >
> >
> > On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]
> <javascript:;>
> > >wrote:
> >
> > > That sounds like a very effective way for developers to kill clusters
> > > with compactions :)
> > >
> > > J-D
> > >
> > > On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
> [EMAIL PROTECTED]<javascript:;>
> > >
> > > wrote:
> > > > JD,
> > > >
> > > >    What about adding a flag for the delete, something like -full or
> > > > -true(it is early).  Once we issue the delete to the proper
> row/region
> > we
> > > > run a flush, then execute a single region major compaction.  That
> way,
> > if
> > > > it is a single record, or a subset of data the impact is minimal.  If
> > the
> > > > delete happens to hit every region we will compact every region(not
> > > ideal).
> > > >  Another thought would be an overwrite, but with versions this logic
> > > > becomes more complicated.
> > > >
> > > >
> > > > On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <
> > [EMAIL PROTECTED] <javascript:;>
> > > >wrote:
> > > >
> > > >> Hey devs,
> > > >>
> > > >> I was presenting at GOTO Amsterdam yesterday and I got a question
> > > >> about a scenario that I've never thought about before. I'm wondering
> > > >> what others think.
> > > >>
> > > >> How do you efficiently wipe out random data in HBase?
> > > >>
> > > >> For example, you have a website and a user asks you to close their
> > > >> account and get rid of the data.
> > > >>
> > > >> Would you say "sure can do, lemme just issue a couple of Deletes!"
> and
> > > >> call it a day? What if you really have to delete the data, not just
> > > >> mask it, because of contractual obligations or local laws?
> > > >>
> > > >> Major compacting is the obvious solution but it seems really
> > > >> inefficient. Let's say you've got some truly random data to delete
> and
> > > >> it happens so that you have at least one row per region to get rid
> > > >> of... then you need to basically rewrite the whole table?

Todd Lipcon
Software Engineer, Cloudera