Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Follow-up to my HBASE-4365 testing


Copy link to this message
-
Re: Follow-up to my HBASE-4365 testing
Yeah.  You would also want a mechanism to prevent queuing the same CF
multiple times, and probably want the completion of one compaction to
trigger a check to see if it should queue another.

A possibly different architecture than the current style of queues would be
to have each Store (all open in memory) keep a compactionPriority score up
to date after events like flushes, compactions, schema changes, etc.  Then
you create a "CompactionPriorityComparator implements Comparator<Store>"
and stick all the Stores into a PriorityQueue.  The async compaction
threads would keep pulling off the head of that queue as long as the head
has compactionPriority > X.
On Sat, Feb 25, 2012 at 3:44 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Interesting. So a compaction request would hold no information beyond the
> CF, really,
> but is just a promise to do a compaction as soon as possible.
> I agree with Ted, we should explore this in a jira.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Matt Corgan <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc:
> Sent: Saturday, February 25, 2012 3:18 PM
> Subject: Re: Follow-up to my HBASE-4365 testing
>
> I've been meaning to look into something regarding compactions for a while
> now that may be relevant here.  It could be that this is already how it
> works, but just to be sure I'll spell out my suspicions...
>
> I did a lot of large uploads when we moved to .92.  Our biggest dataset is
> time series data (partitioned 16 ways with a row prefix).  The actual
> inserting and flushing went extremely quickly, and the parallel compactions
> were churning away.  However, when the compactions inevitably started
> falling behind I noticed a potential problem.  The compaction queue would
> get up to, say, 40, which represented, say, an hour's worth of requests.
> The problem was that by the time a compaction request started executing,
> the CompactionSelection that it held was terribly out of date.  It was
> compacting a small selection (3-5) of the 50 files that were now there.
> Then the next request would compact another (3-5), etc, etc, until the
> queue was empty.  It would have been much better if a CompactionRequest
> decided what files to compact when it got to the head of the queue.  Then
> it could see that there are now 50 files needing compacting and to possibly
> compact the 30 smallest ones, not just 5.  When the insertions were done
> after many hours, I would have preferred it to do one giant major
> compaction, but it sat there and worked through it's compaction queue
> compacting all sorts of different combinations of files.
>
> Said differently, it looks like .92 picks the files to compact at
> compaction request time rather than compaction execution time which is
> problematic when these times grow far apart.  Is that the case?  Maybe
> there are some other effects that are mitigating it...
>
> Matt
>
> On Sat, Feb 25, 2012 at 10:05 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]
> >wrote:
>
> > Hey guys,
> >
> > So in HBASE-4365 I ran multiple uploads and the latest one I reported
> > was a 5TB import on 14 RS and it took 18h with Stack's patch. Now one
> > thing we can see is that apart from some splitting, there's a lot of
> > compacting going on. Stack was wondering exactly how much that IO
> > costs us, so we devised a test where we could upload 5TB with 0
> > compactions. Here are the results:
> >
> > The table was pre-split with 14 regions, 1 per region server.
> > hbase.hstore.compactionThreshold=100
> > hbase.hstore.blockingStoreFiles=110
> > hbase.regionserver.maxlogs=64  (the block size is 128MB)
> > hfile.block.cache.size=0.05
> > hbase.regionserver.global.memstore.lowerLimit=0.40
> > hbase.regionserver.global.memstore.upperLimit=0.74
> > export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE -Xmx14G
> > -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=256m
> > -XX:MaxNewSize=256m"
> >
> > The table had:
> >  MAX_FILESIZE => '549755813888', MEMSTORE_FLUSHSIZE => '549755813888'
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB