|
Alex Baranau
2012-07-09, 19:36
Amandeep Khurana
2012-07-09, 19:38
Alex Baranau
2012-07-09, 19:41
Jonathan Hsieh
2012-07-09, 19:44
Alex Baranau
2012-07-09, 20:05
Jonathan Hsieh
2012-07-10, 12:10
Stack
2012-07-11, 12:51
Alex Baranau
2012-07-11, 14:09
|
-
Can manually remove HFiles (similar to bulk import, but bulk remove)?Alex Baranau 2012-07-09, 19:36
Hello,
I wonder, for purging old data, if I'm OK with "remove all StoreFiles which are older than ..." way, can I do that? To me it seems like this can be a very effective way to remove old data, similar to fast bulk import functionality, but for deletion. Thank you, Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase
-
Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?Amandeep Khurana 2012-07-09, 19:38
I _think_ you should be able to do it and be just fine but you'll need to shut down the region servers before you remove and start them back up after you are done. Someone else closer to the internals can confirm/deny this.
On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote: > Hello, > > I wonder, for purging old data, if I'm OK with "remove all StoreFiles which > are older than ..." way, can I do that? To me it seems like this can be a > very effective way to remove old data, similar to fast bulk import > functionality, but for deletion. > > Thank you, > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase > >
-
Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?Alex Baranau 2012-07-09, 19:41
Heh, this is what I want to avoid actually: restarting RSs.
Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase On Mon, Jul 9, 2012 at 3:38 PM, Amandeep Khurana <[EMAIL PROTECTED]> wrote: > I _think_ you should be able to do it and be just fine but you'll need to > shut down the region servers before you remove and start them back up after > you are done. Someone else closer to the internals can confirm/deny this. > > > On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote: > > > Hello, > > > > I wonder, for purging old data, if I'm OK with "remove all StoreFiles > which > > are older than ..." way, can I do that? To me it seems like this can be a > > very effective way to remove old data, similar to fast bulk import > > functionality, but for deletion. > > > > Thank you, > > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase > > > > > > >
-
Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?Jonathan Hsieh 2012-07-09, 19:44
You could set your ttls and trigger a major compaction ...
Or, (this is pretty advanced) you can probably do it without taking down RS's by: 1) closing the region in the hbase shell 2) deleting the file in the shell 3) reopening the region in the hbase shell Jon. On Mon, Jul 9, 2012 at 12:41 PM, Alex Baranau <[EMAIL PROTECTED]>wrote: > Heh, this is what I want to avoid actually: restarting RSs. > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase > > On Mon, Jul 9, 2012 at 3:38 PM, Amandeep Khurana <[EMAIL PROTECTED]> wrote: > > > I _think_ you should be able to do it and be just fine but you'll need to > > shut down the region servers before you remove and start them back up > after > > you are done. Someone else closer to the internals can confirm/deny this. > > > > > > On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote: > > > > > Hello, > > > > > > I wonder, for purging old data, if I'm OK with "remove all StoreFiles > > which > > > are older than ..." way, can I do that? To me it seems like this can > be a > > > very effective way to remove old data, similar to fast bulk import > > > functionality, but for deletion. > > > > > > Thank you, > > > > > > Alex Baranau > > > ------ > > > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - > HBase > > > > > > > > > > > > > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?Alex Baranau 2012-07-09, 20:05
Hey, this is closer!
However, I think I'd want to avoid major compaction. In fact I was thinking about avoiding any compactions & splitting. E.g. say I process some amount of data every 1 hour (e.g. with MR job), the output is written as a set of HFiles and added to be served by HBase. At the same time I care to keep only 1 week of data. In that case, ideally, I'd like to do the following: * pre-split the table with N regions, to be evenly distributed over the cluster * turn off minor/major compactions (it is OK for me to have 24*7 HFiles per region, given one CF, and I know they will not exceed the region max size) * periodically remove HFiles older than one week By setting up table like this, I'd avoid unnecessary split operations, compact operations, moving Regions (i.e. avoid redundant IO/CPU and hopefully data locality breaking) So, you are saying that major compaction will look at max/min ts metainfo of the HFile and will remove the whole file based on ttl if necessary (without going through the file)? Can I tell it not to actually compact other HFiles (i.e. leave them as is, otherwise it would be not as easy to remove HFiles again in an hour)? I.e. looks like "delete only whole HFiles based on TTL" functionality is wat I need here.. I fear that complexity with removing HFiles can be caused by (block) cache that may hold its information. Is that right? I'm actually OK with HBase to return me the data of files I "deleted" by removing HFiles: I will specify timerange on scans anyways (in this example to omit things older than 1 week). Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase On Mon, Jul 9, 2012 at 3:44 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > You could set your ttls and trigger a major compaction ... > > Or, (this is pretty advanced) you can probably do it without taking down > RS's by: > 1) closing the region in the hbase shell > 2) deleting the file in the shell > 3) reopening the region in the hbase shell > > Jon. > > On Mon, Jul 9, 2012 at 12:41 PM, Alex Baranau <[EMAIL PROTECTED] > >wrote: > > > Heh, this is what I want to avoid actually: restarting RSs. > > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase > > > > On Mon, Jul 9, 2012 at 3:38 PM, Amandeep Khurana <[EMAIL PROTECTED]> > wrote: > > > > > I _think_ you should be able to do it and be just fine but you'll need > to > > > shut down the region servers before you remove and start them back up > > after > > > you are done. Someone else closer to the internals can confirm/deny > this. > > > > > > > > > On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote: > > > > > > > Hello, > > > > > > > > I wonder, for purging old data, if I'm OK with "remove all StoreFiles > > > which > > > > are older than ..." way, can I do that? To me it seems like this can > > be a > > > > very effective way to remove old data, similar to fast bulk import > > > > functionality, but for deletion. > > > > > > > > Thank you, > > > > > > > > Alex Baranau > > > > ------ > > > > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - > > HBase > > > > > > > > > > > > > > > > > > > > > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [EMAIL PROTECTED] >
-
Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?Jonathan Hsieh 2012-07-10, 12:10
On Mon, Jul 9, 2012 at 1:05 PM, Alex Baranau <[EMAIL PROTECTED]>wrote:
> Hey, this is closer! > > However, I think I'd want to avoid major compaction. In fact I was thinking > about avoiding any compactions & splitting. > ... So, you are saying that major compaction will look at max/min ts metainfo > of the HFile and will remove the whole file based on ttl if necessary > (without going through the file)? Can I tell it not to actually compact > other HFiles (i.e. leave them as is, otherwise it would be not as easy to > remove HFiles again in an hour)? I.e. looks like "delete only whole HFiles > based on TTL" functionality is wat I need here.. > > Of the top of my head, I don't know how "smart" the major compaction code is wrt to ttls. I'm pretty sure it isn't smart enough to explicitly ignore specific files. > I fear that complexity with removing HFiles can be caused by (block) cache > that may hold its information. Is that right? I'm actually OK with HBase to > return me the data of files I "deleted" by removing HFiles: I will specify > timerange on scans anyways (in this example to omit things older than 1 > week). > > I'm not sure what the block cache eviction policy is when a single region is closed, but it sounds like you are ok if stale data remains. Sounds like you might want to try the close/delete/open advanced approach on a test cluster to see if it meets your needs. Jon. -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?Stack 2012-07-11, 12:51
On Mon, Jul 9, 2012 at 10:05 PM, Alex Baranau <[EMAIL PROTECTED]> wrote:
> I fear that complexity with removing HFiles can be caused by (block) cache > that may hold its information. Is that right? I'm actually OK with HBase to > return me the data of files I "deleted" by removing HFiles: I will specify > timerange on scans anyways (in this example to omit things older than 1 > week). > I think this is a use case we should support natively. Someone around the corner from us was looking to do this. They load a complete dataset each night and on the weekends they want to just drop the old stuff by removing the hfiles > N days. You could script it now. Look at the hfiles in hdfs -- they have sufficient metadata IIRC -- and then do the prescription Jon suggests above of close, remove, and reopen. We could add an API to do this; i.e. reread hdfs for hfiles (would be nice to do it 'atomically' telling the new API which to drop). You bring up block cache. That should be fine. We shouldn't be reading blocks for files that are no longer open. Old blocks should get aged out. On compaction dropping complete hfiles if they are outside TTL, I'm not sure we have that (didn't look too closely). St.Ack
-
Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?Alex Baranau 2012-07-11, 14:09
Thank you guys for the pointers/info! I'll try to make use of it. If it
turns out into smth (like script, etc.) re-usable I will open a JIRA issue and add it for others to use. Thanx again, Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase On Wed, Jul 11, 2012 at 8:51 AM, Stack <[EMAIL PROTECTED]> wrote: > On Mon, Jul 9, 2012 at 10:05 PM, Alex Baranau <[EMAIL PROTECTED]> > wrote: > > I fear that complexity with removing HFiles can be caused by (block) > cache > > that may hold its information. Is that right? I'm actually OK with HBase > to > > return me the data of files I "deleted" by removing HFiles: I will > specify > > timerange on scans anyways (in this example to omit things older than 1 > > week). > > > > I think this is a use case we should support natively. Someone around > the corner from us was looking to do this. They load a complete > dataset each night and on the weekends they want to just drop the old > stuff by removing the hfiles > N days. > > You could script it now. Look at the hfiles in hdfs -- they have > sufficient metadata IIRC -- and then do the prescription Jon suggests > above of close, remove, and reopen. We could add an API to do this; > i.e. reread hdfs for hfiles (would be nice to do it 'atomically' > telling the new API which to drop). > > You bring up block cache. That should be fine. We shouldn't be > reading blocks for files that are no longer open. Old blocks should > get aged out. > > On compaction dropping complete hfiles if they are outside TTL, I'm > not sure we have that (didn't look too closely). > > St.Ack > |