Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: How to prevent major compaction when doing bulk load provisioning?


Copy link to this message
-
Re: How to prevent major compaction when doing bulk load provisioning?
You are likely just hitting the threshold for a minor compaction and
by picking up all the files (I'm making a guess that it does) it gets
upgraded to a major compaction. The threshold is 3 by default.

So after loading 3 files you should get a compaction per region, then
every other 2 loading you will trigger another per region.

It seems to me that it would be better if you were able to do a single
load for all your files.

J-D

On Thu, Mar 21, 2013 at 6:29 AM, Nicolas Seyvet
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> We are using code similar to
> https://github.com/jrkinley/hbase-bulk-import-example/ in order to
> benchmark our HBase cluster.  We are running a CDH4 installation, and HBase
> is version 0.92.1-cdh4.1.1..  The cluster is composed of 12 slaves and 1
> master and 1 secondary master.
>
> During the bulk load insert, roughly within 3 hours after the start
> (~200Gb), we notice a large drop in performance in the insert rate. At the
> same time, there is a spike in IO and CPU usage.  Connecting to a Region
> Server (RS), the Monitored Task section shows that a compaction is started.
>
> I have set hbase.hregion.max.filesize to 107374182400 (100Gb), and disable
> automatic major compaction hbase.hregion.majorcompactionis set to 0.
>
> What we are doing is that we have 1000 files of synthetic data (csv), where
> each row in a file is one row to insert into HBase, each file contains 600K
> rows (or 600K events).  Our loader works in the following way:
> 1. Look for a file
> 2. When a file is found, prepare a job for that file
> 3. Launch job
> 4. Wait for completion
> 5. Compute insert rate (nb of rows /time)
> 6. Repeat from 1 until there are no more files.
>
> What I understand of the bulk load M/R job is that it produces one HFile
> for each Region.
>
> Questions:
> - How is HStoreFileSize calclulated?
> - What do HStoreFileSize, storeFileSize and hbase.hregion.max.filesize have
> in common?
> - Can the number of HFiles trigger a major compaction?
>
> Thx for help.  I hope my questions make sense.
>
> /Nicolas
+
ramkrishna vasudevan 2013-03-21, 18:05
+
Nicolas Seyvet 2013-03-21, 17:52
+
Nicolas Seyvet 2013-03-21, 19:06
+
Nicolas Seyvet 2013-03-22, 07:12
+
Jean-Daniel Cryans 2013-03-22, 16:32
+
Jean-Daniel Cryans 2013-03-21, 20:21
+
Ted Yu 2013-03-21, 20:05
+
Ted Yu 2013-03-21, 17:10
+
Amit Sela 2013-03-21, 16:47
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB