|
|
-
Data management strategy
Richard Lawrence 2011-12-21, 18:03
Hi I was wondering if I could seek some advance about data management in HBase? I plan to use HBase to store data that has a variable length lifespan, the vast majority will be short but occasionally the data life time will be significantly longer (3 days versus 3 months). Once the lifespan is over I need the data to be deleted at some point in the near future (within a few day is fine). I don’t think I can use standard TTL for this because that’s fixed at a column family level. Therefore, my plan was to run script every few days that looks through external information for what needs to be kept and then updates HBase in some way so that it can understand. With the data in HBase I can then use the standard TTL mechanism to clean up. The two ways I can think of to let HBase know are: Add a co-processor that updates timestamp on each read and then have my process simply read the data. I shied away from this because the documentation indicated the co-processor can’t take row locks. Does that imply that it shouldn’t modify the underlying data. For my use case the timestamp doesn’t have to be perfect the keys are created in a such that the underlying data is fixed at creation time. Add an extra column to each row that’s a cache flag and rewrite that at various intervals so that the timestamp updates and prevents the TTL from deleting it. Are there other best practice alternatives? Thanks Richard
-
Re: Data management strategy
Bryan Beaudreault 2011-12-22, 06:41
The TTL is per column family, but I think you could still manipulate it further. I have no idea if this will work in practice, but I've had success using versions/timestamps for other reasons in the past and this idea just came to me. YMMV.
Determine the maximum amount of time you'll ever want to keep data around. You mentioned 30 days, so let's use that. The timestamps of cell versions are generated automatically by HBase to be System.currentTimeMillis(), but you can easily set the timestamps to something else instead. If you know how long some data should stick around at time of insertion, set the timestamp of the put, with org.apache.hadoop.hbase.client.Put.add(byte[] family, byte[] qualifier, long ts, byte[] value), to System.currentTimeMillis() - 30 days + <real TTL>. You now have per-cell TTLs so to speak.
Like I said, I'd test that this will actually work, and maybe someone else can chime in as to if this sort of version "abuse" would be frowned upon, but I think it may get the job done :).
- Bryan
On Wed, Dec 21, 2011 at 1:03 PM, Richard Lawrence <[EMAIL PROTECTED]>wrote:
> Hi > > I was wondering if I could seek some advance about data management in > HBase? I plan to use HBase to store data that has a variable length > lifespan, the vast majority will be short but occasionally the data life > time will be significantly longer (3 days versus 3 months). Once the > lifespan is over I need the data to be deleted at some point in the near > future (within a few day is fine). I don’t think I can use standard TTL > for this because that’s fixed at a column family level. Therefore, my plan > was to run script every few days that looks through external information > for what needs to be kept and then updates HBase in some way so that it can > understand. With the data in HBase I can then use the standard TTL > mechanism to clean up. > > The two ways I can think of to let HBase know are: > > Add a co-processor that updates timestamp on each read and then have my > process simply read the data. I shied away from this because the > documentation indicated the co-processor can’t take row locks. Does that > imply that it shouldn’t modify the underlying data. For my use case the > timestamp doesn’t have to be perfect the keys are created in a such that > the underlying data is fixed at creation time. > Add an extra column to each row that’s a cache flag and rewrite that at > various intervals so that the timestamp updates and prevents the TTL from > deleting it. > > Are there other best practice alternatives? > > Thanks > > Richard > >
-
Re: Data management strategy
Michel Segel 2011-12-22, 13:21
Richard,
Let's see if I understand what you want to do...
You have some data and you want to store it in some table A. Some of the records/rows in this table have a limited life span of 3 days, others have a limited life span of 3 months. But both are the same records? By this I mean that both records contain the same type of data but there is some business logic that determines which record gets deleted. ( like purge all records that haven't been accessed in the last 3 days.)
If what I imagine is true, you can't use the standard TTL unless you know that after a set N hours or days the record will be deleted. Like all records will self destruct 30 days past creation.
The simplest solution would be to have a column that contains a, timestamp of last access and your application controls when this field gets updated. Then using cron, launch a job that scans the table and removes the rows which meet your delete criteria.
Since co-processors are new... Not yet in any of the commercial releases, I would suggest keeping the logic simple. You can always refactor your code to use Co-processors when you've had time to play with them.
Even with coprocessors because the data dies an arbitrary death, you will still have to purge the data yourself. Hence the cron job that marks the record for deletion and then does a major compaction on the table to really delete the rows...
Of course the standard caveats apply, assuming I really did understand what you wanted...
Oh and KISS is always the best practice... :-)
Sent from a remote device. Please excuse any typos...
Mike Segel
On Dec 21, 2011, at 12:03 PM, Richard Lawrence <[EMAIL PROTECTED]> wrote:
> Hi > > I was wondering if I could seek some advance about data management in HBase? I plan to use HBase to store data that has a variable length lifespan, the vast majority will be short but occasionally the data life time will be significantly longer (3 days versus 3 months). Once the lifespan is over I need the data to be deleted at some point in the near future (within a few day is fine). I don’t think I can use standard TTL for this because that’s fixed at a column family level. Therefore, my plan was to run script every few days that looks through external information for what needs to be kept and then updates HBase in some way so that it can understand. With the data in HBase I can then use the standard TTL mechanism to clean up. > > The two ways I can think of to let HBase know are: > > Add a co-processor that updates timestamp on each read and then have my process simply read the data. I shied away from this because the documentation indicated the co-processor can’t take row locks. Does that imply that it shouldn’t modify the underlying data. For my use case the timestamp doesn’t have to be perfect the keys are created in a such that the underlying data is fixed at creation time. > Add an extra column to each row that’s a cache flag and rewrite that at various intervals so that the timestamp updates and prevents the TTL from deleting it. > > Are there other best practice alternatives? > > Thanks > > Richard >
-
Re: Data management strategy
Andrew Purtell 2011-12-22, 23:09
> I plan to use HBase to store data that has a variable length lifespan
> [...]
Indeed that the simplest approach is usually best.
The simplest way to manage automatic expiration of data over various lifetimes, especially if there are only a few of them, like in your case (3 days versus 3 months): Create a column family for each. Store into a given column family as appropriate. Get or Scan with families included as needed, will retrieve all of the nonexpired data in the row in the given families.
> I don’t think I can use standard TTL for this because that’s fixed at a > column family level. Is that really the case?
I had a use case once where most data was not useful after a couple of weeks, but some data occasionally needed to be promoted to permanent storage. It wasn't convenient to model the transient data and permanent data as separate entities. You might think TTLs couldn't be used for that. However, we created two column families; one with a TTL, one without; a very simple maintenance mapreduce job, run from crontab on the jobtracker, for copying from one to the other, and we were able to use filters to reduce the work this job needed to do; and a very thin presentation layer to give users the illusion that these entities were stored "in the same place" (we needed to give them a REST API anyway). This worked well. There was some modest penalty on read for accessing two stores instead of one, but the performance was within the bounds we needed.
Best regards, - Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Michel Segel <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Thursday, December 22, 2011 5:21 AM > Subject: Re: Data management strategy > > Richard, > > Let's see if I understand what you want to do... > > You have some data and you want to store it in some table A. > Some of the records/rows in this table have a limited life span of 3 days, > others have a limited life span of 3 months. But both are the same records? By > this I mean that both records contain the same type of data but there is some > business logic that determines which record gets deleted. > ( like purge all records that haven't been accessed in the last 3 days.) > > If what I imagine is true, you can't use the standard TTL unless you know > that after a set N hours or days the record will be deleted. Like all records > will self destruct 30 days past creation. > > The simplest solution would be to have a column that contains a, timestamp of > last access and your application controls when this field gets updated. Then > using cron, launch a job that scans the table and removes the rows which meet > your delete criteria. > > Since co-processors are new... Not yet in any of the commercial releases, I > would suggest keeping the logic simple. You can always refactor your code to use > Co-processors when you've had time to play with them. > > Even with coprocessors because the data dies an arbitrary death, you will still > have to purge the data yourself. Hence the cron job that marks the record for > deletion and then does a major compaction on the table to really delete the > rows... > > Of course the standard caveats apply, assuming I really did understand what you > wanted... > > Oh and KISS is always the best practice... :-) > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Dec 21, 2011, at 12:03 PM, Richard Lawrence <[EMAIL PROTECTED]> > wrote: > >> Hi >> >> I was wondering if I could seek some advance about data management in > HBase? I plan to use HBase to store data that has a variable length lifespan, > the vast majority will be short but occasionally the data life time will be > significantly longer (3 days versus 3 months). Once the lifespan is over I need > the data to be deleted at some point in the near future (within a few day is
-
Re: Data management strategy
Richard Lawrence 2011-12-23, 02:18
You've understood correctly Michel and thanks you for your suggestions, I think I'll take the second and manually do TTL.
Andrew - I somewhat over simplified my use case; happy to explain in full but it's probably OTT. I am intrigued by your idea and certainly hadn't thought of anything that "clever/devious" (both meant in good ways); I'm not sure I can use it for this problem but it's certainly something that I will bear in mind. I did think in terms of map reduce at first but it seemed like the best I could get was write a huge file of valid IDs in to Hadoop and then map side join on them inside the job while iterating the table. A simple reading/deleting client process seemed to simplify the operations for the first pass - there only be a few million rows on 5or 6 nodes.
Thanks for the advice, likely to have more questions soon!
Merry Christmas
Richard On Dec 22, 2011, at 18:09, Andrew Purtell <[EMAIL PROTECTED]> wrote:
>> I plan to use HBase to store data that has a variable length lifespan > >> [. > > Indeed that the simplest approach is usually best. > > The simplest way to manage automatic expiration of data over various lifetimes, especially if there are only a few of them, like in your case (3 days versus 3 months): Create a column family for each. Store into a given column family as appropriate. Get or Scan with families included as needed, will retrieve all of the nonexpired data in the row in the given families. > >> I don’t think I can use standard TTL for this because that’s fixed at a >> column family level. > > > Is that really the case? > > I had a use case once where most data was not useful after a couple of weeks, but some data occasionally needed to be promoted to permanent storage. It wasn't convenient to model the transient data and permanent data as separate entities. You might think TTLs couldn't be used for that. However, we created two column families; one with a TTL, one without; a very simple maintenance mapreduce job, run from crontab on the jobtracker, for copying from one to the other, and we were able to use filters to reduce the work this job needed to do; and a very thin presentation layer to give users the illusion that these entities were stored "in the same place" (we needed to give them a REST API anyway). This worked well. There was some modest penalty on read for accessing two stores instead of one, but the performance was within the bounds we needed. > > Best regards, > > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) > > > ----- Original Message ----- >> From: Michel Segel <[EMAIL PROTECTED]> >> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >> Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >> Sent: Thursday, December 22, 2011 5:21 AM >> Subject: Re: Data management strategy >> >> Richard, >> >> Let's see if I understand what you want to do... >> >> You have some data and you want to store it in some table A. >> Some of the records/rows in this table have a limited life span of 3 days, >> others have a limited life span of 3 months. But both are the same records? By >> this I mean that both records contain the same type of data but there is some >> business logic that determines which record gets deleted. >> ( like purge all records that haven't been accessed in the last 3 days.) >> >> If what I imagine is true, you can't use the standard TTL unless you know >> that after a set N hours or days the record will be deleted. Like all records >> will self destruct 30 days past creation. >> >> The simplest solution would be to have a column that contains a, timestamp of >> last access and your application controls when this field gets updated. Then >> using cron, launch a job that scans the table and removes the rows which meet >> your delete criteria. >> >> Since co-processors are new... Not yet in any of the commercial releases, I >> would suggest keeping the logic simple. You can always refactor your code to use
|
|