Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Re: How to prevent major compaction when doing bulk load provisioning?


+
Jean-Daniel Cryans 2013-03-21, 18:03
Copy link to this message
-
Re: How to prevent major compaction when doing bulk load provisioning?
ramkrishna vasudevan 2013-03-21, 18:05
I remember there was another user too who got the same issue. He thought
the minor compaction to be major compaction and was saying that the setting
that we provide to disable the majorcompaction was not working/not taking
effect.

Regards
Ram

On Thu, Mar 21, 2013 at 11:33 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:

> You are likely just hitting the threshold for a minor compaction and
> by picking up all the files (I'm making a guess that it does) it gets
> upgraded to a major compaction. The threshold is 3 by default.
>
> So after loading 3 files you should get a compaction per region, then
> every other 2 loading you will trigger another per region.
>
> It seems to me that it would be better if you were able to do a single
> load for all your files.
>
> J-D
>
> On Thu, Mar 21, 2013 at 6:29 AM, Nicolas Seyvet
> <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > We are using code similar to
> > https://github.com/jrkinley/hbase-bulk-import-example/ in order to
> > benchmark our HBase cluster.  We are running a CDH4 installation, and
> HBase
> > is version 0.92.1-cdh4.1.1..  The cluster is composed of 12 slaves and 1
> > master and 1 secondary master.
> >
> > During the bulk load insert, roughly within 3 hours after the start
> > (~200Gb), we notice a large drop in performance in the insert rate. At
> the
> > same time, there is a spike in IO and CPU usage.  Connecting to a Region
> > Server (RS), the Monitored Task section shows that a compaction is
> started.
> >
> > I have set hbase.hregion.max.filesize to 107374182400 (100Gb), and
> disable
> > automatic major compaction hbase.hregion.majorcompactionis set to 0.
> >
> > What we are doing is that we have 1000 files of synthetic data (csv),
> where
> > each row in a file is one row to insert into HBase, each file contains
> 600K
> > rows (or 600K events).  Our loader works in the following way:
> > 1. Look for a file
> > 2. When a file is found, prepare a job for that file
> > 3. Launch job
> > 4. Wait for completion
> > 5. Compute insert rate (nb of rows /time)
> > 6. Repeat from 1 until there are no more files.
> >
> > What I understand of the bulk load M/R job is that it produces one HFile
> > for each Region.
> >
> > Questions:
> > - How is HStoreFileSize calclulated?
> > - What do HStoreFileSize, storeFileSize and hbase.hregion.max.filesize
> have
> > in common?
> > - Can the number of HFiles trigger a major compaction?
> >
> > Thx for help.  I hope my questions make sense.
> >
> > /Nicolas
>
+
Nicolas Seyvet 2013-03-21, 17:52
+
Nicolas Seyvet 2013-03-21, 19:06
+
Nicolas Seyvet 2013-03-22, 07:12
+
Jean-Daniel Cryans 2013-03-22, 16:32
+
Jean-Daniel Cryans 2013-03-21, 20:21
+
Ted Yu 2013-03-21, 20:05
+
Ted Yu 2013-03-21, 17:10
+
Amit Sela 2013-03-21, 16:47