|
|
-
Re: Tune MapReduce over HBase to insert data
Gerrit Jansen van Vuuren 2013-01-08, 08:56
Note: if you have a huge amount of data using bulk inserts is much faster than using Puts.
Regards, Gerrit
On Tue, Jan 8, 2013 at 7:04 AM, Farrokh Shahriari < [EMAIL PROTECTED]> wrote:
> Tnx Ted, > How can I tune it ? can you tell me !! > I have not yet decided for upgrading,does it give a better performance on > MapReduce job for inserting ? > > On Tue, Jan 8, 2013 at 9:18 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > >> JVM > > >
+
Gerrit Jansen van Vuuren 2013-01-08, 08:56
-
Re: Tune MapReduce over HBase to insert data
Ted Yu 2013-01-08, 06:36
Please take a look at http://hbase.apache.org/book.html#jvmSection 12.2.3, “JVM Garbage Collection Logs”< http://hbase.apache.org/book.html#trouble.log.gc>should be read as well. There is more recent effort to reduce GC activity. Namely HBASE-7404 Bucket Cache:A solution about CMS,Heap Fragment and Big Cache on HBASE It is close to integration to trunk. You can expect 0.94 backport down the road. Cheers On Mon, Jan 7, 2013 at 10:04 PM, Farrokh Shahriari < [EMAIL PROTECTED]> wrote: > Tnx Ted, > How can I tune it ? can you tell me !! > I have not yet decided for upgrading,does it give a better performance on > MapReduce job for inserting ? > > On Tue, Jan 8, 2013 at 9:18 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > JVM >
+
Ted Yu 2013-01-08, 06:36
-
Tune MapReduce over HBase to insert data
Farrokh Shahriari 2013-01-08, 05:05
Hi there I have a cluster with 12 nodes that each of them has 2 core of CPU. Now,I want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ). I've used Map-Reduce over hbase,but I can't achieve proper result . I'd be glad if you tell me what I can do to get better result or which parameters should I config or tune to improve Map-Reduce/Hbase performance ?
Tnx
+
Farrokh Shahriari 2013-01-08, 05:05
-
Re: Tune MapReduce over HBase to insert data
Ted Yu 2013-01-08, 05:12
Have you read through http://hbase.apache.org/book.html#performance ? What version of HBase are you using ? Cheers On Mon, Jan 7, 2013 at 9:05 PM, Farrokh Shahriari < [EMAIL PROTECTED]> wrote: > Hi there > I have a cluster with 12 nodes that each of them has 2 core of CPU. Now,I > want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ). I've used > Map-Reduce over hbase,but I can't achieve proper result . > I'd be glad if you tell me what I can do to get better result or which > parameters should I config or tune to improve Map-Reduce/Hbase performance > ? > > Tnx >
+
Ted Yu 2013-01-08, 05:12
-
Re: Tune MapReduce over HBase to insert data
Farrokh Shahriari 2013-01-08, 05:38
Hi again, I'm using HBase 0.92.1-cdh4.0.0. I have two server machine with 48Gb RAM,12 physical core & 24 logical core that contain 12 nodes(6 nodes on each server). Each node has 8Gb RAM & 2 VCPU. I've set some parameter that get better result like set WAL=off on put,but some parameters like Heap-size,Deferred log flush don't help me. Beside that I have another question,why each time I've run mapreduce,I've got different result time while all the config & hardware are same & not change ? Tnx you guys On Tue, Jan 8, 2013 at 8:42 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > Have you read through http://hbase.apache.org/book.html#performance ? > > What version of HBase are you using ? > > Cheers > > On Mon, Jan 7, 2013 at 9:05 PM, Farrokh Shahriari < > [EMAIL PROTECTED]> wrote: > > > Hi there > > I have a cluster with 12 nodes that each of them has 2 core of CPU. Now,I > > want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ). I've > used > > Map-Reduce over hbase,but I can't achieve proper result . > > I'd be glad if you tell me what I can do to get better result or which > > parameters should I config or tune to improve Map-Reduce/Hbase > performance > > ? > > > > Tnx > > >
+
Farrokh Shahriari 2013-01-08, 05:38
-
Re: Tune MapReduce over HBase to insert data
lars hofhansl 2013-01-08, 08:02
What type of disks and how many? With the default replication factor your 2 (or 6) GB are actually replicated 3 times. 6GB/80s = 75MB/s, twice that if you do not disable the WAL, which a reasonable machine should be able to absorb. The fact that deferred log flush does not help you seems to indicate that you're over IO bound. What's your memstore flush size? Potentially the data is written many times during compactions. In your case you dial down the HDFS replication, since you only have two physical machines anyway. (Set it to 2. If you do not specify any failure zones, you might as well set it to 1... You will lose data if one of your server machines dies anyway). It does not really make that much sense to deploy HBase and HDFS on virtual nodes like this. -- Lars ________________________________ From: Farrokh Shahriari <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, January 7, 2013 9:38 PM Subject: Re: Tune MapReduce over HBase to insert data Hi again, I'm using HBase 0.92.1-cdh4.0.0. I have two server machine with 48Gb RAM,12 physical core & 24 logical core that contain 12 nodes(6 nodes on each server). Each node has 8Gb RAM & 2 VCPU. I've set some parameter that get better result like set WAL=off on put,but some parameters like Heap-size,Deferred log flush don't help me. Beside that I have another question,why each time I've run mapreduce,I've got different result time while all the config & hardware are same & not change ? Tnx you guys On Tue, Jan 8, 2013 at 8:42 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > Have you read through http://hbase.apache.org/book.html#performance ? > > What version of HBase are you using ? > > Cheers > > On Mon, Jan 7, 2013 at 9:05 PM, Farrokh Shahriari < > [EMAIL PROTECTED]> wrote: > > > Hi there > > I have a cluster with 12 nodes that each of them has 2 core of CPU. Now,I > > want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ). I've > used > > Map-Reduce over hbase,but I can't achieve proper result . > > I'd be glad if you tell me what I can do to get better result or which > > parameters should I config or tune to improve Map-Reduce/Hbase > performance > > ? > > > > Tnx > > >
+
lars hofhansl 2013-01-08, 08:02
-
Re: Tune MapReduce over HBase to insert data
Bing Jiang 2013-01-08, 15:28
In our experience, it can enhance mapreduce insert by 1.add regionserver flush thread number 2.add memstore/jvm_heap 3.pre split table region before mapreduce 4.add large and small compaction thread number. please correct me if wrong, or any other better ideas. On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED]> wrote: > What type of disks and how many? > With the default replication factor your 2 (or 6) GB are actually > replicated 3 times. > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, which a > reasonable machine should be able to absorb. > The fact that deferred log flush does not help you seems to indicate that > you're over IO bound. > > > What's your memstore flush size? Potentially the data is written many > times during compactions. > > > In your case you dial down the HDFS replication, since you only have two > physical machines anyway. > (Set it to 2. If you do not specify any failure zones, you might as well > set it to 1... You will lose data if one of your server machines dies > anyway). > > It does not really make that much sense to deploy HBase and HDFS on > virtual nodes like this. > -- Lars > > > > ________________________________ > From: Farrokh Shahriari <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, January 7, 2013 9:38 PM > Subject: Re: Tune MapReduce over HBase to insert data > > Hi again, > I'm using HBase 0.92.1-cdh4.0.0. > I have two server machine with 48Gb RAM,12 physical core & 24 logical core > that contain 12 nodes(6 nodes on each server). Each node has 8Gb RAM & 2 > VCPU. > I've set some parameter that get better result like set WAL=off on put,but > some parameters like Heap-size,Deferred log flush don't help me. > Beside that I have another question,why each time I've run mapreduce,I've > got different result time while all the config & hardware are same & not > change ? > > Tnx you guys > > On Tue, Jan 8, 2013 at 8:42 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Have you read through http://hbase.apache.org/book.html#performance ? > > > > What version of HBase are you using ? > > > > Cheers > > > > On Mon, Jan 7, 2013 at 9:05 PM, Farrokh Shahriari < > > [EMAIL PROTECTED]> wrote: > > > > > Hi there > > > I have a cluster with 12 nodes that each of them has 2 core of CPU. > Now,I > > > want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ). I've > > used > > > Map-Reduce over hbase,but I can't achieve proper result . > > > I'd be glad if you tell me what I can do to get better result or which > > > parameters should I config or tune to improve Map-Reduce/Hbase > > performance > > > ? > > > > > > Tnx > > > > >
+
Bing Jiang 2013-01-08, 15:28
-
Re: Tune MapReduce over HBase to insert data
Asaf Mesika 2013-01-08, 19:01
Start by testing HDFS throughput by doing s simple copyFromLocal using Hadoop command line shell (bin/hadoop fs -copyFromLocal pathTo8GBFile /tmp/dummyFile1). If you have 1000Mbit/sec network between the computers, you should get around 75 MB/sec. On Tuesday, January 8, 2013, Bing Jiang wrote: > In our experience, it can enhance mapreduce insert by > 1.add regionserver flush thread number > 2.add memstore/jvm_heap > 3.pre split table region before mapreduce > 4.add large and small compaction thread number. > > please correct me if wrong, or any other better ideas. > On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED] <javascript:;>> > wrote: > > > What type of disks and how many? > > With the default replication factor your 2 (or 6) GB are actually > > replicated 3 times. > > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, which a > > reasonable machine should be able to absorb. > > The fact that deferred log flush does not help you seems to indicate that > > you're over IO bound. > > > > > > What's your memstore flush size? Potentially the data is written many > > times during compactions. > > > > > > In your case you dial down the HDFS replication, since you only have two > > physical machines anyway. > > (Set it to 2. If you do not specify any failure zones, you might as well > > set it to 1... You will lose data if one of your server machines dies > > anyway). > > > > It does not really make that much sense to deploy HBase and HDFS on > > virtual nodes like this. > > -- Lars > > > > > > > > ________________________________ > > From: Farrokh Shahriari <[EMAIL PROTECTED] <javascript:;>> > > To: [EMAIL PROTECTED] <javascript:;> > > Sent: Monday, January 7, 2013 9:38 PM > > Subject: Re: Tune MapReduce over HBase to insert data > > > > Hi again, > > I'm using HBase 0.92.1-cdh4.0.0. > > I have two server machine with 48Gb RAM,12 physical core & 24 logical > core > > that contain 12 nodes(6 nodes on each server). Each node has 8Gb RAM & 2 > > VCPU. > > I've set some parameter that get better result like set WAL=off on > put,but > > some parameters like Heap-size,Deferred log flush don't help me. > > Beside that I have another question,why each time I've run mapreduce,I've > > got different result time while all the config & hardware are same & not > > change ? > > > > Tnx you guys > > > > On Tue, Jan 8, 2013 at 8:42 AM, Ted Yu <[EMAIL PROTECTED]<javascript:;>> > wrote: > > > > > Have you read through http://hbase.apache.org/book.html#performance ? > > > > > > What version of HBase are you using ? > > > > > > Cheers > > > > > > On Mon, Jan 7, 2013 at 9:05 PM, Farrokh Shahriari < > > > [EMAIL PROTECTED] <javascript:;>> wrote: > > > > > > > Hi there > > > > I have a cluster with 12 nodes that each of them has 2 core of CPU. > > Now,I > > > > want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ). I've > > > used > > > > Map-Reduce over hbase,but I can't achieve proper result . > > > > I'd be glad if you tell me what I can do to get better result or > which > > > > parameters should I config or tune to improve Map-Reduce/Hbase > > > performance > > > > ? > > > > > > > > Tnx > > > > > > > >
+
Asaf Mesika 2013-01-08, 19:01
-
Re: Tune MapReduce over HBase to insert data
Farrokh Shahriari 2013-01-13, 05:29
Thank you guys,let me change these configuration & test mapreduce again. On Tue, Jan 8, 2013 at 10:31 PM, Asaf Mesika <[EMAIL PROTECTED]> wrote: > Start by testing HDFS throughput by doing s simple copyFromLocal using > Hadoop command line shell (bin/hadoop fs -copyFromLocal pathTo8GBFile > /tmp/dummyFile1). If you have 1000Mbit/sec network between the computers, > you should get around 75 MB/sec. > > On Tuesday, January 8, 2013, Bing Jiang wrote: > > > In our experience, it can enhance mapreduce insert by > > 1.add regionserver flush thread number > > 2.add memstore/jvm_heap > > 3.pre split table region before mapreduce > > 4.add large and small compaction thread number. > > > > please correct me if wrong, or any other better ideas. > > On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED]<javascript:;>> > > wrote: > > > > > What type of disks and how many? > > > With the default replication factor your 2 (or 6) GB are actually > > > replicated 3 times. > > > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, which a > > > reasonable machine should be able to absorb. > > > The fact that deferred log flush does not help you seems to indicate > that > > > you're over IO bound. > > > > > > > > > What's your memstore flush size? Potentially the data is written many > > > times during compactions. > > > > > > > > > In your case you dial down the HDFS replication, since you only have > two > > > physical machines anyway. > > > (Set it to 2. If you do not specify any failure zones, you might as > well > > > set it to 1... You will lose data if one of your server machines dies > > > anyway). > > > > > > It does not really make that much sense to deploy HBase and HDFS on > > > virtual nodes like this. > > > -- Lars > > > > > > > > > > > > ________________________________ > > > From: Farrokh Shahriari <[EMAIL PROTECTED]<javascript:;>> > > > To: [EMAIL PROTECTED] <javascript:;> > > > Sent: Monday, January 7, 2013 9:38 PM > > > Subject: Re: Tune MapReduce over HBase to insert data > > > > > > Hi again, > > > I'm using HBase 0.92.1-cdh4.0.0. > > > I have two server machine with 48Gb RAM,12 physical core & 24 logical > > core > > > that contain 12 nodes(6 nodes on each server). Each node has 8Gb RAM & > 2 > > > VCPU. > > > I've set some parameter that get better result like set WAL=off on > > put,but > > > some parameters like Heap-size,Deferred log flush don't help me. > > > Beside that I have another question,why each time I've run > mapreduce,I've > > > got different result time while all the config & hardware are same & > not > > > change ? > > > > > > Tnx you guys > > > > > > On Tue, Jan 8, 2013 at 8:42 AM, Ted Yu <[EMAIL PROTECTED] > <javascript:;>> > > wrote: > > > > > > > Have you read through http://hbase.apache.org/book.html#performance?> > > > > > > > What version of HBase are you using ? > > > > > > > > Cheers > > > > > > > > On Mon, Jan 7, 2013 at 9:05 PM, Farrokh Shahriari < > > > > [EMAIL PROTECTED] <javascript:;>> wrote: > > > > > > > > > Hi there > > > > > I have a cluster with 12 nodes that each of them has 2 core of CPU. > > > Now,I > > > > > want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ). > I've > > > > used > > > > > Map-Reduce over hbase,but I can't achieve proper result . > > > > > I'd be glad if you tell me what I can do to get better result or > > which > > > > > parameters should I config or tune to improve Map-Reduce/Hbase > > > > performance > > > > > ? > > > > > > > > > > Tnx > > > > > > > > > > > >
+
Farrokh Shahriari 2013-01-13, 05:29
-
Re: Tune MapReduce over HBase to insert data
Anoop John 2013-01-13, 06:45
Hi Can you think of using HFileOutputFormat ? Here you use TableOutputFormat now. There will be put calls to HTable. Instead in HFileOutput format the MR will write the HFiles directly.[No flushes , compactions] Later using LoadIncrementalHFiles need to load the HFiles to the regions. May help you.. -Anoop- On Sun, Jan 13, 2013 at 10:59 AM, Farrokh Shahriari < [EMAIL PROTECTED]> wrote: > Thank you guys,let me change these configuration & test mapreduce again. > > On Tue, Jan 8, 2013 at 10:31 PM, Asaf Mesika <[EMAIL PROTECTED]> > wrote: > > > Start by testing HDFS throughput by doing s simple copyFromLocal using > > Hadoop command line shell (bin/hadoop fs -copyFromLocal pathTo8GBFile > > /tmp/dummyFile1). If you have 1000Mbit/sec network between the computers, > > you should get around 75 MB/sec. > > > > On Tuesday, January 8, 2013, Bing Jiang wrote: > > > > > In our experience, it can enhance mapreduce insert by > > > 1.add regionserver flush thread number > > > 2.add memstore/jvm_heap > > > 3.pre split table region before mapreduce > > > 4.add large and small compaction thread number. > > > > > > please correct me if wrong, or any other better ideas. > > > On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED] > <javascript:;>> > > > wrote: > > > > > > > What type of disks and how many? > > > > With the default replication factor your 2 (or 6) GB are actually > > > > replicated 3 times. > > > > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, which a > > > > reasonable machine should be able to absorb. > > > > The fact that deferred log flush does not help you seems to indicate > > that > > > > you're over IO bound. > > > > > > > > > > > > What's your memstore flush size? Potentially the data is written many > > > > times during compactions. > > > > > > > > > > > > In your case you dial down the HDFS replication, since you only have > > two > > > > physical machines anyway. > > > > (Set it to 2. If you do not specify any failure zones, you might as > > well > > > > set it to 1... You will lose data if one of your server machines dies > > > > anyway). > > > > > > > > It does not really make that much sense to deploy HBase and HDFS on > > > > virtual nodes like this. > > > > -- Lars > > > > > > > > > > > > > > > > ________________________________ > > > > From: Farrokh Shahriari <[EMAIL PROTECTED] > <javascript:;>> > > > > To: [EMAIL PROTECTED] <javascript:;> > > > > Sent: Monday, January 7, 2013 9:38 PM > > > > Subject: Re: Tune MapReduce over HBase to insert data > > > > > > > > Hi again, > > > > I'm using HBase 0.92.1-cdh4.0.0. > > > > I have two server machine with 48Gb RAM,12 physical core & 24 logical > > > core > > > > that contain 12 nodes(6 nodes on each server). Each node has 8Gb RAM > & > > 2 > > > > VCPU. > > > > I've set some parameter that get better result like set WAL=off on > > > put,but > > > > some parameters like Heap-size,Deferred log flush don't help me. > > > > Beside that I have another question,why each time I've run > > mapreduce,I've > > > > got different result time while all the config & hardware are same & > > not > > > > change ? > > > > > > > > Tnx you guys > > > > > > > > On Tue, Jan 8, 2013 at 8:42 AM, Ted Yu <[EMAIL PROTECTED] > > <javascript:;>> > > > wrote: > > > > > > > > > Have you read through > http://hbase.apache.org/book.html#performance?> > > > > > > > > > What version of HBase are you using ? > > > > > > > > > > Cheers > > > > > > > > > > On Mon, Jan 7, 2013 at 9:05 PM, Farrokh Shahriari < > > > > > [EMAIL PROTECTED] <javascript:;>> wrote: > > > > > > > > > > > Hi there > > > > > > I have a cluster with 12 nodes that each of them has 2 core of > CPU. > > > > Now,I > > > > > > want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ). > > I've > > > > > used > > > > > > Map-Reduce over hbase,but I can't achieve proper result . > > > > > > I'd be glad if you tell me what I can do to get better result or
+
Anoop John 2013-01-13, 06:45
-
Re: Tune MapReduce over HBase to insert data
Ted Yu 2013-01-08, 05:48
Have you tuned the JVM parameter of hbase ? If you have Ganglia, did you observe high variation in network latency on the 6 nodes ? HBase 0.92.2 has been released. Do you plan to upgrade to 0.92.2 or 0.94.3 ? Cheers On Mon, Jan 7, 2013 at 9:38 PM, Farrokh Shahriari < [EMAIL PROTECTED]> wrote: > Hi again, > I'm using HBase 0.92.1-cdh4.0.0. > I have two server machine with 48Gb RAM,12 physical core & 24 logical core > that contain 12 nodes(6 nodes on each server). Each node has 8Gb RAM & 2 > VCPU. > I've set some parameter that get better result like set WAL=off on put,but > some parameters like Heap-size,Deferred log flush don't help me. > Beside that I have another question,why each time I've run mapreduce,I've > got different result time while all the config & hardware are same & not > change ? > > Tnx you guys > > On Tue, Jan 8, 2013 at 8:42 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Have you read through http://hbase.apache.org/book.html#performance ? > > > > What version of HBase are you using ? > > > > Cheers > > > > On Mon, Jan 7, 2013 at 9:05 PM, Farrokh Shahriari < > > [EMAIL PROTECTED]> wrote: > > > > > Hi there > > > I have a cluster with 12 nodes that each of them has 2 core of CPU. > Now,I > > > want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ). I've > > used > > > Map-Reduce over hbase,but I can't achieve proper result . > > > I'd be glad if you tell me what I can do to get better result or which > > > parameters should I config or tune to improve Map-Reduce/Hbase > > performance > > > ? > > > > > > Tnx > > > > > >
+
Ted Yu 2013-01-08, 05:48
-
Re: Tune MapReduce over HBase to insert data
Bing Jiang 2013-01-13, 09:31
hi,anoop. Why not hbase mapreduce package contains the tools like this? Anoop John <[EMAIL PROTECTED]>编写: >Hi > Can you think of using HFileOutputFormat ? Here you use >TableOutputFormat now. There will be put calls to HTable. Instead in >HFileOutput format the MR will write the HFiles directly.[No flushes , >compactions] Later using LoadIncrementalHFiles need to load the HFiles to >the regions. May help you.. > >-Anoop- > >On Sun, Jan 13, 2013 at 10:59 AM, Farrokh Shahriari < >[EMAIL PROTECTED]> wrote: > >> Thank you guys,let me change these configuration & test mapreduce again. >> >> On Tue, Jan 8, 2013 at 10:31 PM, Asaf Mesika <[EMAIL PROTECTED]> >> wrote: >> >> > Start by testing HDFS throughput by doing s simple copyFromLocal using >> > Hadoop command line shell (bin/hadoop fs -copyFromLocal pathTo8GBFile >> > /tmp/dummyFile1). If you have 1000Mbit/sec network between the computers, >> > you should get around 75 MB/sec. >> > >> > On Tuesday, January 8, 2013, Bing Jiang wrote: >> > >> > > In our experience, it can enhance mapreduce insert by >> > > 1.add regionserver flush thread number >> > > 2.add memstore/jvm_heap >> > > 3.pre split table region before mapreduce >> > > 4.add large and small compaction thread number. >> > > >> > > please correct me if wrong, or any other better ideas. >> > > On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED] >> <javascript:;>> >> > > wrote: >> > > >> > > > What type of disks and how many? >> > > > With the default replication factor your 2 (or 6) GB are actually >> > > > replicated 3 times. >> > > > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, which a >> > > > reasonable machine should be able to absorb. >> > > > The fact that deferred log flush does not help you seems to indicate >> > that >> > > > you're over IO bound. >> > > > >> > > > >> > > > What's your memstore flush size? Potentially the data is written many >> > > > times during compactions. >> > > > >> > > > >> > > > In your case you dial down the HDFS replication, since you only have >> > two >> > > > physical machines anyway. >> > > > (Set it to 2. If you do not specify any failure zones, you might as >> > well >> > > > set it to 1... You will lose data if one of your server machines dies >> > > > anyway). >> > > > >> > > > It does not really make that much sense to deploy HBase and HDFS on >> > > > virtual nodes like this. >> > > > -- Lars >> > > > >> > > > >> > > > >> > > > ________________________________ >> > > > From: Farrokh Shahriari <[EMAIL PROTECTED] >> <javascript:;>> >> > > > To: [EMAIL PROTECTED] <javascript:;> >> > > > Sent: Monday, January 7, 2013 9:38 PM >> > > > Subject: Re: Tune MapReduce over HBase to insert data >> > > > >> > > > Hi again, >> > > > I'm using HBase 0.92.1-cdh4.0.0. >> > > > I have two server machine with 48Gb RAM,12 physical core & 24 logical >> > > core >> > > > that contain 12 nodes(6 nodes on each server). Each node has 8Gb RAM >> & >> > 2 >> > > > VCPU. >> > > > I've set some parameter that get better result like set WAL=off on >> > > put,but >> > > > some parameters like Heap-size,Deferred log flush don't help me. >> > > > Beside that I have another question,why each time I've run >> > mapreduce,I've >> > > > got different result time while all the config & hardware are same & >> > not >> > > > change ? >> > > > >> > > > Tnx you guys >> > > > >> > > > On Tue, Jan 8, 2013 at 8:42 AM, Ted Yu <[EMAIL PROTECTED] >> > <javascript:;>> >> > > wrote: >> > > > >> > > > > Have you read through >> http://hbase.apache.org/book.html#performance?>> > > > > >> > > > > What version of HBase are you using ? >> > > > > >> > > > > Cheers >> > > > > >> > > > > On Mon, Jan 7, 2013 at 9:05 PM, Farrokh Shahriari < >> > > > > [EMAIL PROTECTED] <javascript:;>> wrote: >> > > > > >> > > > > > Hi there >> > > > > > I have a cluster with 12 nodes that each of them has 2 core of >> CPU. >> > > > Now,I >> > > > > > want insert large data about 2Gb in 80 sec ( or 6Gb in 240sec ).
+
Bing Jiang 2013-01-13, 09:31
-
Re: Tune MapReduce over HBase to insert data
Ted Yu 2013-01-13, 15:30
Both HFileOutputFormat and LoadIncrementalHFiles are in mapreduce package. Cheers On Sun, Jan 13, 2013 at 1:31 AM, Bing Jiang <[EMAIL PROTECTED]>wrote: > hi,anoop. > Why not hbase mapreduce package contains the tools like this? > > Anoop John <[EMAIL PROTECTED]>编写: > > >Hi > > Can you think of using HFileOutputFormat ? Here you use > >TableOutputFormat now. There will be put calls to HTable. Instead in > >HFileOutput format the MR will write the HFiles directly.[No flushes , > >compactions] Later using LoadIncrementalHFiles need to load the HFiles to > >the regions. May help you.. > > > >-Anoop- > > > >On Sun, Jan 13, 2013 at 10:59 AM, Farrokh Shahriari < > >[EMAIL PROTECTED]> wrote: > > > >> Thank you guys,let me change these configuration & test mapreduce again. > >> > >> On Tue, Jan 8, 2013 at 10:31 PM, Asaf Mesika <[EMAIL PROTECTED]> > >> wrote: > >> > >> > Start by testing HDFS throughput by doing s simple copyFromLocal using > >> > Hadoop command line shell (bin/hadoop fs -copyFromLocal pathTo8GBFile > >> > /tmp/dummyFile1). If you have 1000Mbit/sec network between the > computers, > >> > you should get around 75 MB/sec. > >> > > >> > On Tuesday, January 8, 2013, Bing Jiang wrote: > >> > > >> > > In our experience, it can enhance mapreduce insert by > >> > > 1.add regionserver flush thread number > >> > > 2.add memstore/jvm_heap > >> > > 3.pre split table region before mapreduce > >> > > 4.add large and small compaction thread number. > >> > > > >> > > please correct me if wrong, or any other better ideas. > >> > > On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED] > >> <javascript:;>> > >> > > wrote: > >> > > > >> > > > What type of disks and how many? > >> > > > With the default replication factor your 2 (or 6) GB are actually > >> > > > replicated 3 times. > >> > > > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, which > a > >> > > > reasonable machine should be able to absorb. > >> > > > The fact that deferred log flush does not help you seems to > indicate > >> > that > >> > > > you're over IO bound. > >> > > > > >> > > > > >> > > > What's your memstore flush size? Potentially the data is written > many > >> > > > times during compactions. > >> > > > > >> > > > > >> > > > In your case you dial down the HDFS replication, since you only > have > >> > two > >> > > > physical machines anyway. > >> > > > (Set it to 2. If you do not specify any failure zones, you might > as > >> > well > >> > > > set it to 1... You will lose data if one of your server machines > dies > >> > > > anyway). > >> > > > > >> > > > It does not really make that much sense to deploy HBase and HDFS > on > >> > > > virtual nodes like this. > >> > > > -- Lars > >> > > > > >> > > > > >> > > > > >> > > > ________________________________ > >> > > > From: Farrokh Shahriari <[EMAIL PROTECTED] > >> <javascript:;>> > >> > > > To: [EMAIL PROTECTED] <javascript:;> > >> > > > Sent: Monday, January 7, 2013 9:38 PM > >> > > > Subject: Re: Tune MapReduce over HBase to insert data > >> > > > > >> > > > Hi again, > >> > > > I'm using HBase 0.92.1-cdh4.0.0. > >> > > > I have two server machine with 48Gb RAM,12 physical core & 24 > logical > >> > > core > >> > > > that contain 12 nodes(6 nodes on each server). Each node has 8Gb > RAM > >> & > >> > 2 > >> > > > VCPU. > >> > > > I've set some parameter that get better result like set WAL=off on > >> > > put,but > >> > > > some parameters like Heap-size,Deferred log flush don't help me. > >> > > > Beside that I have another question,why each time I've run > >> > mapreduce,I've > >> > > > got different result time while all the config & hardware are > same & > >> > not > >> > > > change ? > >> > > > > >> > > > Tnx you guys > >> > > > > >> > > > On Tue, Jan 8, 2013 at 8:42 AM, Ted Yu <[EMAIL PROTECTED] > >> > <javascript:;>> > >> > > wrote: > >> > > > > >> > > > > Have you read through > >> http://hbase.apache.org/book.html#performance?> >
+
Ted Yu 2013-01-13, 15:30
-
Re: Tune MapReduce over HBase to insert data
Farrokh Shahriari 2013-01-14, 05:58
Bing Jiang, What do you mean by add compaction thread number ? Because, in Hbase-site.xml we have compactionqueuesize or compactionthreshold but not the parameter that you have said.
Thanks you if you guide me.
On Sun, Jan 13, 2013 at 7:00 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> Both HFileOutputFormat and LoadIncrementalHFiles are in mapreduce package. > > Cheers > > On Sun, Jan 13, 2013 at 1:31 AM, Bing Jiang <[EMAIL PROTECTED] > >wrote: > > > hi,anoop. > > Why not hbase mapreduce package contains the tools like this? > > > > Anoop John <[EMAIL PROTECTED]>编写: > > > > >Hi > > > Can you think of using HFileOutputFormat ? Here you use > > >TableOutputFormat now. There will be put calls to HTable. Instead in > > >HFileOutput format the MR will write the HFiles directly.[No flushes , > > >compactions] Later using LoadIncrementalHFiles need to load the HFiles > to > > >the regions. May help you.. > > > > > >-Anoop- > > > > > >On Sun, Jan 13, 2013 at 10:59 AM, Farrokh Shahriari < > > >[EMAIL PROTECTED]> wrote: > > > > > >> Thank you guys,let me change these configuration & test mapreduce > again. > > >> > > >> On Tue, Jan 8, 2013 at 10:31 PM, Asaf Mesika <[EMAIL PROTECTED]> > > >> wrote: > > >> > > >> > Start by testing HDFS throughput by doing s simple copyFromLocal > using > > >> > Hadoop command line shell (bin/hadoop fs -copyFromLocal > pathTo8GBFile > > >> > /tmp/dummyFile1). If you have 1000Mbit/sec network between the > > computers, > > >> > you should get around 75 MB/sec. > > >> > > > >> > On Tuesday, January 8, 2013, Bing Jiang wrote: > > >> > > > >> > > In our experience, it can enhance mapreduce insert by > > >> > > 1.add regionserver flush thread number > > >> > > 2.add memstore/jvm_heap > > >> > > 3.pre split table region before mapreduce > > >> > > 4.add large and small compaction thread number. > > >> > > > > >> > > please correct me if wrong, or any other better ideas. > > >> > > On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED] > > >> <javascript:;>> > > >> > > wrote: > > >> > > > > >> > > > What type of disks and how many? > > >> > > > With the default replication factor your 2 (or 6) GB are > actually > > >> > > > replicated 3 times. > > >> > > > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, > which > > a > > >> > > > reasonable machine should be able to absorb. > > >> > > > The fact that deferred log flush does not help you seems to > > indicate > > >> > that > > >> > > > you're over IO bound. > > >> > > > > > >> > > > > > >> > > > What's your memstore flush size? Potentially the data is written > > many > > >> > > > times during compactions. > > >> > > > > > >> > > > > > >> > > > In your case you dial down the HDFS replication, since you only > > have > > >> > two > > >> > > > physical machines anyway. > > >> > > > (Set it to 2. If you do not specify any failure zones, you might > > as > > >> > well > > >> > > > set it to 1... You will lose data if one of your server machines > > dies > > >> > > > anyway). > > >> > > > > > >> > > > It does not really make that much sense to deploy HBase and HDFS > > on > > >> > > > virtual nodes like this. > > >> > > > -- Lars > > >> > > > > > >> > > > > > >> > > > > > >> > > > ________________________________ > > >> > > > From: Farrokh Shahriari <[EMAIL PROTECTED] > > >> <javascript:;>> > > >> > > > To: [EMAIL PROTECTED] <javascript:;> > > >> > > > Sent: Monday, January 7, 2013 9:38 PM > > >> > > > Subject: Re: Tune MapReduce over HBase to insert data > > >> > > > > > >> > > > Hi again, > > >> > > > I'm using HBase 0.92.1-cdh4.0.0. > > >> > > > I have two server machine with 48Gb RAM,12 physical core & 24 > > logical > > >> > > core > > >> > > > that contain 12 nodes(6 nodes on each server). Each node has 8Gb > > RAM > > >> & > > >> > 2 > > >> > > > VCPU. > > >> > > > I've set some parameter that get better result like set WAL=off > on > > >> > > put,but > > >> > > > some parameters like Heap-size,Deferred log flush don't help me.
+
Farrokh Shahriari 2013-01-14, 05:58
-
Re: Tune MapReduce over HBase to insert data
Bing Jiang 2013-01-15, 01:33
Hi, mohandes.zebeleh you can adjust parameter as below( Major Compaction, Minor Compaction, Split): if you do not set, it will retain default value(1). <property> <name>hbase.regionserver.thread.compaction.large</name> <value>5</value> </property> <property> <name>hbase.regionserver.thread.compaction.small</name> <value>10</value> </property> <property> <name>hbase.regionserver.thread.split</name> <value>5</value> </property> Regards! Bing 2013/1/14 Farrokh Shahriari <[EMAIL PROTECTED]> > Bing Jiang, What do you mean by add compaction thread number ? Because, in > Hbase-site.xml we have compactionqueuesize or compactionthreshold but not > the parameter that you have said. > > Thanks you if you guide me. > > On Sun, Jan 13, 2013 at 7:00 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Both HFileOutputFormat and LoadIncrementalHFiles are in mapreduce > package. > > > > Cheers > > > > On Sun, Jan 13, 2013 at 1:31 AM, Bing Jiang <[EMAIL PROTECTED] > > >wrote: > > > > > hi,anoop. > > > Why not hbase mapreduce package contains the tools like this? > > > > > > Anoop John <[EMAIL PROTECTED]>编写: > > > > > > >Hi > > > > Can you think of using HFileOutputFormat ? Here you use > > > >TableOutputFormat now. There will be put calls to HTable. Instead in > > > >HFileOutput format the MR will write the HFiles directly.[No flushes , > > > >compactions] Later using LoadIncrementalHFiles need to load the HFiles > > to > > > >the regions. May help you.. > > > > > > > >-Anoop- > > > > > > > >On Sun, Jan 13, 2013 at 10:59 AM, Farrokh Shahriari < > > > >[EMAIL PROTECTED]> wrote: > > > > > > > >> Thank you guys,let me change these configuration & test mapreduce > > again. > > > >> > > > >> On Tue, Jan 8, 2013 at 10:31 PM, Asaf Mesika <[EMAIL PROTECTED] > > > > > >> wrote: > > > >> > > > >> > Start by testing HDFS throughput by doing s simple copyFromLocal > > using > > > >> > Hadoop command line shell (bin/hadoop fs -copyFromLocal > > pathTo8GBFile > > > >> > /tmp/dummyFile1). If you have 1000Mbit/sec network between the > > > computers, > > > >> > you should get around 75 MB/sec. > > > >> > > > > >> > On Tuesday, January 8, 2013, Bing Jiang wrote: > > > >> > > > > >> > > In our experience, it can enhance mapreduce insert by > > > >> > > 1.add regionserver flush thread number > > > >> > > 2.add memstore/jvm_heap > > > >> > > 3.pre split table region before mapreduce > > > >> > > 4.add large and small compaction thread number. > > > >> > > > > > >> > > please correct me if wrong, or any other better ideas. > > > >> > > On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED] > > > >> <javascript:;>> > > > >> > > wrote: > > > >> > > > > > >> > > > What type of disks and how many? > > > >> > > > With the default replication factor your 2 (or 6) GB are > > actually > > > >> > > > replicated 3 times. > > > >> > > > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, > > which > > > a > > > >> > > > reasonable machine should be able to absorb. > > > >> > > > The fact that deferred log flush does not help you seems to > > > indicate > > > >> > that > > > >> > > > you're over IO bound. > > > >> > > > > > > >> > > > > > > >> > > > What's your memstore flush size? Potentially the data is > written > > > many > > > >> > > > times during compactions. > > > >> > > > > > > >> > > > > > > >> > > > In your case you dial down the HDFS replication, since you > only > > > have > > > >> > two > > > >> > > > physical machines anyway. > > > >> > > > (Set it to 2. If you do not specify any failure zones, you > might > > > as > > > >> > well > > > >> > > > set it to 1... You will lose data if one of your server > machines > > > dies > > > >> > > > anyway). > > > >> > > > > > > >> > > > It does not really make that much sense to deploy HBase and > HDFS > > > on > > > >> > > > virtual nodes like this. > > > >> > > > -- Lars > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > ________________________________ Bing Jiang Tel:(86)134-2619-1361 weibo: http://weibo.com/jiangbingloverBLOG: http://blog.sina.com.cn/jiangbingloverNational Research Center for Intelligent Computing Systems Institute of Computing technology Graduate University of Chinese Academy of Science
+
Bing Jiang 2013-01-15, 01:33
-
Re: Tune MapReduce over HBase to insert data
Farrokh Shahriari 2013-01-16, 11:20
I've noticed that if I comment the write command in Map function ( Context.write(row,put)),it will just take 40 sec. The differences is about 30 seconds,that's weird for me,what do you think ?
the parameters that are useful up to now: hbase.hstore.blockingStoreFiles => 20 hbase.hregion.memstore.block.multiplier => 4 hbase.hregion.memstore.flush.size => 1073741824 speculative.execution => false wal => false
should I change these two parameter : io.sort.mb & io.sort.factor ?
Mohandes
On Tue, Jan 15, 2013 at 5:03 AM, Bing Jiang <[EMAIL PROTECTED]>wrote:
> Hi, mohandes.zebeleh > you can adjust parameter as below( Major Compaction, Minor Compaction, > Split): > if you do not set, it will retain default value(1). > > <property> > <name>hbase.regionserver.thread.compaction.large</name> > <value>5</value> > </property> > <property> > <name>hbase.regionserver.thread.compaction.small</name> > <value>10</value> > </property> > <property> > <name>hbase.regionserver.thread.split</name> > <value>5</value> > </property> > > Regards! > > Bing > > 2013/1/14 Farrokh Shahriari <[EMAIL PROTECTED]> > >> Bing Jiang, What do you mean by add compaction thread number ? Because, in >> Hbase-site.xml we have compactionqueuesize or compactionthreshold but not >> the parameter that you have said. >> >> Thanks you if you guide me. >> >> On Sun, Jan 13, 2013 at 7:00 PM, Ted Yu <[EMAIL PROTECTED]> wrote: >> >> > Both HFileOutputFormat and LoadIncrementalHFiles are in mapreduce >> package. >> > >> > Cheers >> > >> > On Sun, Jan 13, 2013 at 1:31 AM, Bing Jiang <[EMAIL PROTECTED] >> > >wrote: >> > >> > > hi,anoop. >> > > Why not hbase mapreduce package contains the tools like this? >> > > >> > > Anoop John <[EMAIL PROTECTED]>编写: >> > > >> > > >Hi >> > > > Can you think of using HFileOutputFormat ? Here you use >> > > >TableOutputFormat now. There will be put calls to HTable. Instead in >> > > >HFileOutput format the MR will write the HFiles directly.[No flushes >> , >> > > >compactions] Later using LoadIncrementalHFiles need to load the >> HFiles >> > to >> > > >the regions. May help you.. >> > > > >> > > >-Anoop- >> > > > >> > > >On Sun, Jan 13, 2013 at 10:59 AM, Farrokh Shahriari < >> > > >[EMAIL PROTECTED]> wrote: >> > > > >> > > >> Thank you guys,let me change these configuration & test mapreduce >> > again. >> > > >> >> > > >> On Tue, Jan 8, 2013 at 10:31 PM, Asaf Mesika < >> [EMAIL PROTECTED]> >> > > >> wrote: >> > > >> >> > > >> > Start by testing HDFS throughput by doing s simple copyFromLocal >> > using >> > > >> > Hadoop command line shell (bin/hadoop fs -copyFromLocal >> > pathTo8GBFile >> > > >> > /tmp/dummyFile1). If you have 1000Mbit/sec network between the >> > > computers, >> > > >> > you should get around 75 MB/sec. >> > > >> > >> > > >> > On Tuesday, January 8, 2013, Bing Jiang wrote: >> > > >> > >> > > >> > > In our experience, it can enhance mapreduce insert by >> > > >> > > 1.add regionserver flush thread number >> > > >> > > 2.add memstore/jvm_heap >> > > >> > > 3.pre split table region before mapreduce >> > > >> > > 4.add large and small compaction thread number. >> > > >> > > >> > > >> > > please correct me if wrong, or any other better ideas. >> > > >> > > On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED] >> > > >> <javascript:;>> >> > > >> > > wrote: >> > > >> > > >> > > >> > > > What type of disks and how many? >> > > >> > > > With the default replication factor your 2 (or 6) GB are >> > actually >> > > >> > > > replicated 3 times. >> > > >> > > > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, >> > which >> > > a >> > > >> > > > reasonable machine should be able to absorb. >> > > >> > > > The fact that deferred log flush does not help you seems to >> > > indicate >> > > >> > that >> > > >> > > > you're over IO bound. >> > > >> > > > >> > > >> > > > >> > > >> > > > What's your memstore flush size? Potentially the data is >> written >> >
+
Farrokh Shahriari 2013-01-16, 11:20
-
Re: Tune MapReduce over HBase to insert data
Adrien Mogenet 2013-02-04, 19:53
I didn't find documentation about these settings ; is it recommended to set it greater than the default value ("1") on modern servers ? Or is it an internal behavior we should not tune by ourselves? On Tue, Jan 15, 2013 at 2:33 AM, Bing Jiang <[EMAIL PROTECTED]>wrote: > Hi, mohandes.zebeleh > you can adjust parameter as below( Major Compaction, Minor Compaction, > Split): > if you do not set, it will retain default value(1). > > <property> > <name>hbase.regionserver.thread.compaction.large</name> > <value>5</value> > </property> > <property> > <name>hbase.regionserver.thread.compaction.small</name> > <value>10</value> > </property> > <property> > <name>hbase.regionserver.thread.split</name> > <value>5</value> > </property> > > Regards! > > Bing > > 2013/1/14 Farrokh Shahriari <[EMAIL PROTECTED]> > > > Bing Jiang, What do you mean by add compaction thread number ? Because, > in > > Hbase-site.xml we have compactionqueuesize or compactionthreshold but not > > the parameter that you have said. > > > > Thanks you if you guide me. > > > > On Sun, Jan 13, 2013 at 7:00 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > Both HFileOutputFormat and LoadIncrementalHFiles are in mapreduce > > package. > > > > > > Cheers > > > > > > On Sun, Jan 13, 2013 at 1:31 AM, Bing Jiang <[EMAIL PROTECTED] > > > >wrote: > > > > > > > hi,anoop. > > > > Why not hbase mapreduce package contains the tools like this? > > > > > > > > Anoop John <[EMAIL PROTECTED]>编写: > > > > > > > > >Hi > > > > > Can you think of using HFileOutputFormat ? Here you > use > > > > >TableOutputFormat now. There will be put calls to HTable. Instead in > > > > >HFileOutput format the MR will write the HFiles directly.[No > flushes , > > > > >compactions] Later using LoadIncrementalHFiles need to load the > HFiles > > > to > > > > >the regions. May help you.. > > > > > > > > > >-Anoop- > > > > > > > > > >On Sun, Jan 13, 2013 at 10:59 AM, Farrokh Shahriari < > > > > >[EMAIL PROTECTED]> wrote: > > > > > > > > > >> Thank you guys,let me change these configuration & test mapreduce > > > again. > > > > >> > > > > >> On Tue, Jan 8, 2013 at 10:31 PM, Asaf Mesika < > [EMAIL PROTECTED] > > > > > > > >> wrote: > > > > >> > > > > >> > Start by testing HDFS throughput by doing s simple copyFromLocal > > > using > > > > >> > Hadoop command line shell (bin/hadoop fs -copyFromLocal > > > pathTo8GBFile > > > > >> > /tmp/dummyFile1). If you have 1000Mbit/sec network between the > > > > computers, > > > > >> > you should get around 75 MB/sec. > > > > >> > > > > > >> > On Tuesday, January 8, 2013, Bing Jiang wrote: > > > > >> > > > > > >> > > In our experience, it can enhance mapreduce insert by > > > > >> > > 1.add regionserver flush thread number > > > > >> > > 2.add memstore/jvm_heap > > > > >> > > 3.pre split table region before mapreduce > > > > >> > > 4.add large and small compaction thread number. > > > > >> > > > > > > >> > > please correct me if wrong, or any other better ideas. > > > > >> > > On Jan 8, 2013 4:02 PM, "lars hofhansl" <[EMAIL PROTECTED] > > > > >> <javascript:;>> > > > > >> > > wrote: > > > > >> > > > > > > >> > > > What type of disks and how many? > > > > >> > > > With the default replication factor your 2 (or 6) GB are > > > actually > > > > >> > > > replicated 3 times. > > > > >> > > > 6GB/80s = 75MB/s, twice that if you do not disable the WAL, > > > which > > > > a > > > > >> > > > reasonable machine should be able to absorb. > > > > >> > > > The fact that deferred log flush does not help you seems to > > > > indicate > > > > >> > that > > > > >> > > > you're over IO bound. > > > > >> > > > > > > > >> > > > > > > > >> > > > What's your memstore flush size? Potentially the data is > > written > > > > many > > > > >> > > > times during compactions. > > > > >> > > > > > > > >> > > > > > > > >> > > > In your case you dial down the HDFS replication, since you > > only > > > > have > > > > >> > two > > > > >> > > > physical machines anyway. Adrien Mogenet 06.59.16.64.22 http://www.mogenet.me
+
Adrien Mogenet 2013-02-04, 19:53
|
|