Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: How to prevent major compaction when doing bulk load provisioning?


+
Jean-Daniel Cryans 2013-03-21, 18:03
+
ramkrishna vasudevan 2013-03-21, 18:05
+
Nicolas Seyvet 2013-03-21, 17:52
+
Nicolas Seyvet 2013-03-21, 19:06
+
Nicolas Seyvet 2013-03-22, 07:12
+
Jean-Daniel Cryans 2013-03-22, 16:32
Copy link to this message
-
Re: How to prevent major compaction when doing bulk load provisioning?
On Thu, Mar 21, 2013 at 12:06 PM, Nicolas Seyvet
<[EMAIL PROTECTED]> wrote:
> @Ram: You are entirely correct, I made the exact same mistakes of mixing up
> Large and minor compaction.  By looking closely, what I see is that at
> around 200 HFiles per region it starts minor compacting files per group of
> 10 HFiles.  The "problem" seems that this minor compacting never stops even
> when there are about 20 HFiles left.  It just keep on going and on taking
> more and more time (I guess because the files to compact are getting
> bigger).
>
> Of course in parallel we keep on adding more and more data.
>
> @J-D: "It seems to me that it would be better if you were able to do a
> single load for all your files." Yes, I agree.. but that is not what we are
> testing, our use case is to use 1min batch files.

I worked on a very similar use case recently and would recommend
against doing bulk loads like this. The way bulk loaded files are
treated by the compaction selection algorithm is broken when loads are
done in a continuous fashion. The solution to this is in HBASE-7842[1]
but it is still being worked on.

What you are seeing is that the files picked up for compactions will
often include the bigger already-compacted files. As those files get
bigger, compactions will take longer and longer, up to a point where
the data that is selected for compaction is greater than your
compacting capacity.

The workaround would be to use the normal API as files will be more
properly selected for compaction, but it won't be as fast/efficient as
the continuous bulk load solution should be if the selection algo
wasn't broken.

J-D

1. https://issues.apache.org/jira/browse/HBASE-7842
+
Ted Yu 2013-03-21, 20:05
+
Ted Yu 2013-03-21, 17:10
+
Amit Sela 2013-03-21, 16:47
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB