Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> Efficiently wiping out random data?


+
Jean-Daniel Cryans 2013-06-19, 12:31
+
Kevin Odell 2013-06-19, 12:39
+
Jean-Daniel Cryans 2013-06-19, 12:46
+
Kevin Odell 2013-06-19, 12:48
Copy link to this message
-
Re: Efficiently wiping out random data?
Chances are that date isn't completely "random". For instance, with a user
they are likely to have an id in their row key, so doing a filtering (with
a custom scanner) major compaction would clean that up. With Sergey's
compaction stuff coming in you could break that out even further and only
have to compact a small set of files to get that removal.

So it's hard, but as its not our direct use case, it's gonna be a few extra
hoops.

On Wednesday, June 19, 2013, Kevin O'dell wrote:

> Yeah, the immutable nature of HDFS is biting us here.
>
>
> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]<javascript:;>
> >wrote:
>
> > That sounds like a very effective way for developers to kill clusters
> > with compactions :)
> >
> > J-D
> >
> > On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <[EMAIL PROTECTED]<javascript:;>
> >
> > wrote:
> > > JD,
> > >
> > >    What about adding a flag for the delete, something like -full or
> > > -true(it is early).  Once we issue the delete to the proper row/region
> we
> > > run a flush, then execute a single region major compaction.  That way,
> if
> > > it is a single record, or a subset of data the impact is minimal.  If
> the
> > > delete happens to hit every region we will compact every region(not
> > ideal).
> > >  Another thought would be an overwrite, but with versions this logic
> > > becomes more complicated.
> > >
> > >
> > > On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <
> [EMAIL PROTECTED] <javascript:;>
> > >wrote:
> > >
> > >> Hey devs,
> > >>
> > >> I was presenting at GOTO Amsterdam yesterday and I got a question
> > >> about a scenario that I've never thought about before. I'm wondering
> > >> what others think.
> > >>
> > >> How do you efficiently wipe out random data in HBase?
> > >>
> > >> For example, you have a website and a user asks you to close their
> > >> account and get rid of the data.
> > >>
> > >> Would you say "sure can do, lemme just issue a couple of Deletes!" and
> > >> call it a day? What if you really have to delete the data, not just
> > >> mask it, because of contractual obligations or local laws?
> > >>
> > >> Major compacting is the obvious solution but it seems really
> > >> inefficient. Let's say you've got some truly random data to delete and
> > >> it happens so that you have at least one row per region to get rid
> > >> of... then you need to basically rewrite the whole table?
> > >>
> > >> My answer was such, and I told the attendee that it's not an easy use
> > >> case to manage in HBase.
> > >>
> > >> Thoughts?
> > >>
> > >> J-D
> > >>
> > >
> > >
> > >
> > > --
> > > Kevin O'Dell
> > > Systems Engineer, Cloudera
> >
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>
--
-------------------
Jesse Yates
@jesse_yates
jyates.github.com
+
Todd Lipcon 2013-06-19, 16:27
+
Ian Varley 2013-06-19, 18:28
+
Matt Corgan 2013-06-19, 21:15
+
lars hofhansl 2013-06-20, 09:35
+
Jean-Marc Spaggiari 2013-06-20, 12:39
+
Ian Varley 2013-06-23, 18:53
+
Andrew Purtell 2013-06-23, 22:32
+
Andrew Purtell 2013-06-23, 22:31
+
Jean-Daniel Cryans 2013-06-24, 17:58
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB