Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Efficiently wiping out random data?


Copy link to this message
-
Re: Efficiently wiping out random data?
Chances are that date isn't completely "random". For instance, with a user
they are likely to have an id in their row key, so doing a filtering (with
a custom scanner) major compaction would clean that up. With Sergey's
compaction stuff coming in you could break that out even further and only
have to compact a small set of files to get that removal.

So it's hard, but as its not our direct use case, it's gonna be a few extra
hoops.

On Wednesday, June 19, 2013, Kevin O'dell wrote:

> Yeah, the immutable nature of HDFS is biting us here.
>
>
> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]<javascript:;>
> >wrote:
>
> > That sounds like a very effective way for developers to kill clusters
> > with compactions :)
> >
> > J-D
> >
> > On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <[EMAIL PROTECTED]<javascript:;>
> >
> > wrote:
> > > JD,
> > >
> > >    What about adding a flag for the delete, something like -full or
> > > -true(it is early).  Once we issue the delete to the proper row/region
> we
> > > run a flush, then execute a single region major compaction.  That way,
> if
> > > it is a single record, or a subset of data the impact is minimal.  If
> the
> > > delete happens to hit every region we will compact every region(not
> > ideal).
> > >  Another thought would be an overwrite, but with versions this logic
> > > becomes more complicated.
> > >
> > >
> > > On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <
> [EMAIL PROTECTED] <javascript:;>
> > >wrote:
> > >
> > >> Hey devs,
> > >>
> > >> I was presenting at GOTO Amsterdam yesterday and I got a question
> > >> about a scenario that I've never thought about before. I'm wondering
> > >> what others think.
> > >>
> > >> How do you efficiently wipe out random data in HBase?
> > >>
> > >> For example, you have a website and a user asks you to close their
> > >> account and get rid of the data.
> > >>
> > >> Would you say "sure can do, lemme just issue a couple of Deletes!" and
> > >> call it a day? What if you really have to delete the data, not just
> > >> mask it, because of contractual obligations or local laws?
> > >>
> > >> Major compacting is the obvious solution but it seems really
> > >> inefficient. Let's say you've got some truly random data to delete and
> > >> it happens so that you have at least one row per region to get rid
> > >> of... then you need to basically rewrite the whole table?
> > >>
> > >> My answer was such, and I told the attendee that it's not an easy use
> > >> case to manage in HBase.
> > >>
> > >> Thoughts?
> > >>
> > >> J-D
> > >>
> > >
> > >
> > >
> > > --
> > > Kevin O'Dell
> > > Systems Engineer, Cloudera
> >
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>
--
-------------------
Jesse Yates
@jesse_yates
jyates.github.com