Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - Efficiently wiping out random data?


+
Jean-Daniel Cryans 2013-06-19, 12:31
+
Kevin Odell 2013-06-19, 12:39
+
Jean-Daniel Cryans 2013-06-19, 12:46
+
Kevin Odell 2013-06-19, 12:48
+
Jesse Yates 2013-06-19, 15:12
+
Todd Lipcon 2013-06-19, 16:27
+
Ian Varley 2013-06-19, 18:28
+
Matt Corgan 2013-06-19, 21:15
Copy link to this message
-
Re: Efficiently wiping out random data?
lars hofhansl 2013-06-20, 09:35
IMHO the "proper" of doing such things is encryption.

0-ing the values or even overwriting with a pattern typically leaves traces of the old data on a magnetic platter that can be retrieved with proper forensics. (Secure erase of SSD is typically pretty secure, though).
For such use cases, files (HFiles) should be encrypted and the decryption keys should just be forgotten at the appropriate times.
I realize that for J-D's specific use case doing this at the HFile level would be very difficult.

Maybe the KVs' values could be stored encrypted with a user specific key. Deleting the user's data then means to forget that users key.
-- Lars

________________________________
From: Matt Corgan <[EMAIL PROTECTED]>
To: dev <[EMAIL PROTECTED]>
Sent: Wednesday, June 19, 2013 2:15 PM
Subject: Re: Efficiently wiping out random data?
Would it be possible to zero-out all the value bytes for cells in existing
HFiles?  They keys would remain, but if you knew that ahead of time you
could design your keys so they don't contain important info.
On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <[EMAIL PROTECTED]> wrote:

> At least in some cases, the answer to that question ("do you even have to
> destroy your tapes?") is a resounding "yes". For some extreme cases (think
> health care, privacy, etc), companies do all RDBMS backups to disk instead
> of tape for that reason. (Transaction logs are considered different, I
> guess because they're inherently transient? Who knows.)
>
> The "no time travel" fix doesn't work, because you could still change that
> code or ACL in the future and get back to the data. In these cases, one
> must provably destroy the data.
>
> That said, forcing full compactions (especially if they can be targeted
> via stripes or levels or something) is an OK way to handle it, maybe
> eventually with more ways to nice it down so it doesn't hose your cluster.
>
> Ian
>
> On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:
>
> I'd also question what exactly the regulatory requirements for deletion
> are. For example, if you had tape backups of your Oracle DB, would you have
> to drive to your off-site storage facility, grab every tape you ever made,
> and zero out the user's data as well? I doubt it, considering tapes have
> basically the same storage characteristics as HDFS in terms of inability to
> random write.
>
> Another example: deletes work the same way in most databases -- eg in
> postgres, deletion of a record just consists of setting a record's "xmax"
> column to the current transaction ID. This is equivalent to a tombstone,
> and you have to wait for a VACUUM process to come along and actually delete
> the record entry. In Oracle, the record will persist in a rollback segment
> for a configurable amount of time, and you can use a Flashback query to
> time travel and see it again. In Vertica, you also set an "xmax" entry and
> wait until the next merge-out (like a major compaction).
>
> Even in a filesystem, deletion doesn't typically remove data, unless you
> use a tool like srm. It just unlinks the inode from the directory tree.
>
> So, if any of the above systems satisfy their use case, then HBase ought to
> as well. Perhaps there's an ACL we could add which would allow/disallow
> users from doing time travel more than N seconds in the past..  maybe that
> would help allay fears?
>
> -Todd
>
> On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>>wrote:
>
> Chances are that date isn't completely "random". For instance, with a user
> they are likely to have an id in their row key, so doing a filtering (with
> a custom scanner) major compaction would clean that up. With Sergey's
> compaction stuff coming in you could break that out even further and only
> have to compact a small set of files to get that removal.
>
> So it's hard, but as its not our direct use case, it's gonna be a few extra
> hoops.
>
> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>
> Yeah, the immutable nature of HDFS is biting us here.
+
Jean-Marc Spaggiari 2013-06-20, 12:39
+
Ian Varley 2013-06-23, 18:53
+
Andrew Purtell 2013-06-23, 22:32
+
Andrew Purtell 2013-06-23, 22:31
+
Jean-Daniel Cryans 2013-06-24, 17:58