|
|
-
Re: Best Hbase Storage for PIG
Michel Segel 2012-04-26, 11:48
32 cores w 32GB of Ram?
Pig isn't fast, but I have to question what you are using for hardware. Who makes a 32 core box? Assuming you mean 16 physical cores.
7 drives? Not enough spindles for the number of cores.
Sent from a remote device. Please excuse any typos...
Mike Segel
On Apr 26, 2012, at 6:38 AM, Rajgopal Vaithiyanathan <[EMAIL PROTECTED]> wrote:
> Hey all, > > The default - HBaseStorage() takes hell lot of time for puts. > > In a cluster of 5 machines, insertion of 175 Million records took 4Hours 45 > minutes > Question - Is this good enough ? > each machine has 32 cores and 32GB ram with 7*600GB harddisks. HBASE's heap > has been configured to 8GB. > If the put speed is low, how can i improve them..? > > I tried tweaking the TableOutputFormat by increasing the WriteBufferSize to > 24MB, and adding the multi put feature (by adding 10,000 puts in ArrayList > and putting it as a batch). After doing this, it started throwing > > java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: > Call to slave1/172.21.208.176:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected > local=/172.21.208.176:41135remote=slave1/ > 172.21.208.176:60020] > > Which i assume is because, the clients took too long to put. > > The detailed log is as follows from one of the reduce job is as follows. > > I've 'censored' some of the details. which i assume is Okay.! :P > 2012-04-23 20:07:12,815 INFO org.apache.hadoop.util.NativeCodeLoader: > Loaded the native-hadoop library > 2012-04-23 20:07:13,097 WARN > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already > exists! > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > environment:zookeeper.version=3.4.2-1221870, built on 12/21/2011 20:46 GMT > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > environment:host.name=*****.***** > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.version=1.6.0_22 > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.vendor=Sun Microsystems Inc. > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.home=/usr/lib/jvm/java-6-openjdk/jre > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.class.path=**************************** > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.library.path=********************** > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.io.tmpdir=*************************** > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.compiler=<NA> > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > environment:os.name=Linux > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > environment:os.arch=amd64 > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > environment:os.version=2.6.38-8-server > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > environment:user.name=raj > > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > environment:user.home=********* > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > environment:user.dir=**********************: > 2012-04-23 20:07:13,790 INFO org.apache.zookeeper.ZooKeeper: Initiating > client connection, connectString=master:2181 sessionTimeout=180000 > watcher=hconnection > 2012-04-23 20:07:13,822 INFO org.apache.zookeeper.ClientCnxn: Opening > socket connection to server /172.21.208.180:2181 > 2012-04-23 20:07:13,823 INFO > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: The identifier of > this process is [EMAIL PROTECTED]e1 > 2012-04-23 20:07:13,825 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to master/172.21.208.180:2181, initiating session
+
Michel Segel 2012-04-26, 11:48
-
Re: Best Hbase Storage for PIG
Rajgopal Vaithiyanathan 2012-04-26, 12:09
My bad.
I had used cat /proc/cpuinfo | grep "processor" | wc -l cat /proc/cpuinfo | grep “physical id” | sort | uniq | wc -l => 4
so its 4 physical cores then!
and free -m gives me this. total used free shared buffers cached Mem: 32174 31382 792 0 123 27339 -/+ buffers/cache: 3918 28256 Swap: 24575 0 24575
On Thu, Apr 26, 2012 at 5:18 PM, Michel Segel <[EMAIL PROTECTED]>wrote:
> 32 cores w 32GB of Ram? > > Pig isn't fast, but I have to question what you are using for hardware. > Who makes a 32 core box? > Assuming you mean 16 physical cores. > > 7 drives? Not enough spindles for the number of cores. > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Apr 26, 2012, at 6:38 AM, Rajgopal Vaithiyanathan <[EMAIL PROTECTED]> > wrote: > > > Hey all, > > > > The default - HBaseStorage() takes hell lot of time for puts. > > > > In a cluster of 5 machines, insertion of 175 Million records took 4Hours > 45 > > minutes > > Question - Is this good enough ? > > each machine has 32 cores and 32GB ram with 7*600GB harddisks. HBASE's > heap > > has been configured to 8GB. > > If the put speed is low, how can i improve them..? > > > > I tried tweaking the TableOutputFormat by increasing the WriteBufferSize > to > > 24MB, and adding the multi put feature (by adding 10,000 puts in > ArrayList > > and putting it as a batch). After doing this, it started throwing > > > > java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: > > Call to slave1/172.21.208.176:60020 failed on socket timeout exception: > > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > > channel to be ready for read. ch : > > java.nio.channels.SocketChannel[connected > > local=/172.21.208.176:41135remote=slave1/ > > 172.21.208.176:60020] > > > > Which i assume is because, the clients took too long to put. > > > > The detailed log is as follows from one of the reduce job is as follows. > > > > I've 'censored' some of the details. which i assume is Okay.! :P > > 2012-04-23 20:07:12,815 INFO org.apache.hadoop.util.NativeCodeLoader: > > Loaded the native-hadoop library > > 2012-04-23 20:07:13,097 WARN > > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi > already > > exists! > > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:zookeeper.version=3.4.2-1221870, built on 12/21/2011 20:46 > GMT > > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:host.name=*****.***** > > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:java.version=1.6.0_22 > > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:java.vendor=Sun Microsystems Inc. > > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:java.home=/usr/lib/jvm/java-6-openjdk/jre > > 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:java.class.path=**************************** > > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:java.library.path=********************** > > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:java.io.tmpdir=*************************** > > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:java.compiler=<NA> > > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:os.name=Linux > > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:os.arch=amd64 > > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:os.version=2.6.38-8-server > > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:user.name=raj > > > > 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > > environment:user.home=*********
+
Rajgopal Vaithiyanathan 2012-04-26, 12:09
-
Re: Best Hbase Storage for PIG
Michel Segel 2012-04-26, 13:41
Ok... 5 machines... Total cluster? Is that 5 DN? Each machine 1quad core, 32gb ram, 7 x600GB not sure what types of drives. so let's assume 1control node running NN, JT, HM, ZK And 4 DN running DN,TT,RS.
We don't know your Schema, row size, or network. ( 10GBe, 1GBe, 100MBe?)
We also don't know if you've tuned GC implemented MSLABS ... Etc.
So 4 hours for 175Million rows? Could be ok. Write your insert using a java M/R and see how long it takes.
Nor do we know how many. Slots you have on each box. 10k rows in a batch put() not really a good idea. What's your region size? Lots to think about before you can ask if you are doing the right thing, or if PIG is the bottleneck. Sent from a remote device. Please excuse any typos...
Mike Segel
On Apr 26, 2012, at 7:09 AM, Rajgopal Vaithiyanathan <[EMAIL PROTECTED]> wrote:
> My bad. > > I had used cat /proc/cpuinfo | grep "processor" | wc -l > cat /proc/cpuinfo | grep “physical id” | sort | uniq | wc -l => 4 > > so its 4 physical cores then! > > and free -m gives me this. > total used free shared buffers cached > Mem: 32174 31382 792 0 123 27339 > -/+ buffers/cache: 3918 28256 > Swap: 24575 0 24575 > > > > On Thu, Apr 26, 2012 at 5:18 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > >> 32 cores w 32GB of Ram? >> >> Pig isn't fast, but I have to question what you are using for hardware. >> Who makes a 32 core box? >> Assuming you mean 16 physical cores. >> >> 7 drives? Not enough spindles for the number of cores. >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On Apr 26, 2012, at 6:38 AM, Rajgopal Vaithiyanathan <[EMAIL PROTECTED]> >> wrote: >> >>> Hey all, >>> >>> The default - HBaseStorage() takes hell lot of time for puts. >>> >>> In a cluster of 5 machines, insertion of 175 Million records took 4Hours >> 45 >>> minutes >>> Question - Is this good enough ? >>> each machine has 32 cores and 32GB ram with 7*600GB harddisks. HBASE's >> heap >>> has been configured to 8GB. >>> If the put speed is low, how can i improve them..? >>> >>> I tried tweaking the TableOutputFormat by increasing the WriteBufferSize >> to >>> 24MB, and adding the multi put feature (by adding 10,000 puts in >> ArrayList >>> and putting it as a batch). After doing this, it started throwing >>> >>> java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: >>> Call to slave1/172.21.208.176:60020 failed on socket timeout exception: >>> java.net.SocketTimeoutException: 60000 millis timeout while waiting for >>> channel to be ready for read. ch : >>> java.nio.channels.SocketChannel[connected >>> local=/172.21.208.176:41135remote=slave1/ >>> 172.21.208.176:60020] >>> >>> Which i assume is because, the clients took too long to put. >>> >>> The detailed log is as follows from one of the reduce job is as follows. >>> >>> I've 'censored' some of the details. which i assume is Okay.! :P >>> 2012-04-23 20:07:12,815 INFO org.apache.hadoop.util.NativeCodeLoader: >>> Loaded the native-hadoop library >>> 2012-04-23 20:07:13,097 WARN >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >> already >>> exists! >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >>> environment:zookeeper.version=3.4.2-1221870, built on 12/21/2011 20:46 >> GMT >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >>> environment:host.name=*****.***** >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >>> environment:java.version=1.6.0_22 >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >>> environment:java.vendor=Sun Microsystems Inc. >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >>> environment:java.home=/usr/lib/jvm/java-6-openjdk/jre >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >>> environment:java.class.path=****************************
+
Michel Segel 2012-04-26, 13:41
-
Re: Best Hbase Storage for PIG
Rajgopal Vaithiyanathan 2012-04-27, 06:13
@doug Regarding monotonically increasing keys, I took care by randomizing the data order. Regarding pre-created regions - did not know i can do that. Thanks. But when i looked into the case studys, the section "HBase Region With Non-Local Data". Will this be a problem when I pre-create the regions? @michel Schema is simple.. one column family... in which we'll insert a max of 10 columns. 4 columns are compulsory. and other 6 cols are sparsely filled.
KEY: a string of 50 Characters Col1: int Col2: string of 20 characters col3: string of 20 characters col4 : int col5 : int [ sparse ] col6: float [sparse] col7: string of 3 char [sparse] col8: string of 3 char [sparse] col9: string of 3 char [sparse]
I've kept max.reduce.tasks = 16 ..
Haven't set MSLABS.. what values do you recommend for my cluster.
> "10k rows in a batch put() not really a good idea." Hmm.. should it be less or more ?
> "What's your region size?" I did not set hbase.hregion.max.filesize manually.. please recommend. neither did i pre-create regions..
I'm not saying PIG will be a bottleneck.. The Output format / configurations of hbase /hardware can be... need suggestions on the same...
Can I use HFileOutputFormat in this case? can i get some example snippets?
Thanks Raj
On Thu, Apr 26, 2012 at 7:11 PM, Michel Segel <[EMAIL PROTECTED]>wrote:
> Ok... > 5 machines... > Total cluster? Is that 5 DN? > Each machine 1quad core, 32gb ram, 7 x600GB not sure what types of drives. > > > so let's assume 1control node running NN, JT, HM, ZK > And 4 DN running DN,TT,RS. > > We don't know your Schema, row size, or network. ( 10GBe, 1GBe, 100MBe?) > > We also don't know if you've tuned GC implemented MSLABS ... Etc. > > So 4 hours for 175Million rows? Could be ok. > Write your insert using a java M/R and see how long it takes. > > Nor do we know how many. Slots you have on each box. > 10k rows in a batch put() not really a good idea. > What's your region size? > > > Lots to think about before you can ask if you are doing the right thing, > or if PIG is the bottleneck. > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Apr 26, 2012, at 7:09 AM, Rajgopal Vaithiyanathan <[EMAIL PROTECTED]> > wrote: > > > My bad. > > > > I had used cat /proc/cpuinfo | grep "processor" | wc -l > > cat /proc/cpuinfo | grep “physical id” | sort | uniq | wc -l => 4 > > > > so its 4 physical cores then! > > > > and free -m gives me this. > > total used free shared buffers cached > > Mem: 32174 31382 792 0 123 27339 > > -/+ buffers/cache: 3918 28256 > > Swap: 24575 0 24575 > > > > > > > > On Thu, Apr 26, 2012 at 5:18 PM, Michel Segel <[EMAIL PROTECTED] > >wrote: > > > >> 32 cores w 32GB of Ram? > >> > >> Pig isn't fast, but I have to question what you are using for hardware. > >> Who makes a 32 core box? > >> Assuming you mean 16 physical cores. > >> > >> 7 drives? Not enough spindles for the number of cores. > >> > >> Sent from a remote device. Please excuse any typos... > >> > >> Mike Segel > >> > >> On Apr 26, 2012, at 6:38 AM, Rajgopal Vaithiyanathan < > [EMAIL PROTECTED]> > >> wrote: > >> > >>> Hey all, > >>> > >>> The default - HBaseStorage() takes hell lot of time for puts. > >>> > >>> In a cluster of 5 machines, insertion of 175 Million records took > 4Hours > >> 45 > >>> minutes > >>> Question - Is this good enough ? > >>> each machine has 32 cores and 32GB ram with 7*600GB harddisks. HBASE's > >> heap > >>> has been configured to 8GB. > >>> If the put speed is low, how can i improve them..? > >>> > >>> I tried tweaking the TableOutputFormat by increasing the > WriteBufferSize > >> to > >>> 24MB, and adding the multi put feature (by adding 10,000 puts in > >> ArrayList > >>> and putting it as a batch). After doing this, it started throwing > >>> > >>> java.util.concurrent.ExecutionException: > java.net.SocketTimeoutException:
+
Rajgopal Vaithiyanathan 2012-04-27, 06:13
-
Re: Best Hbase Storage for PIG
Raghu Angadi 2012-04-27, 16:38
A lot of factors can affect HBase performance.. could even be some hardware related (slow network, or disk)..
How fast can you scan? does that work well? You could take jstack of the clients (reducer) and region servers when you are writing and post them and/or hbase list. This would point to where the bottleneck is.
Raghu.
On Thu, Apr 26, 2012 at 11:13 PM, Rajgopal Vaithiyanathan < [EMAIL PROTECTED]> wrote:
> @doug > Regarding monotonically increasing keys, I took care by randomizing the > data order. > Regarding pre-created regions - did not know i can do that. Thanks. > But when i looked into the case studys, the section "HBase Region With > Non-Local Data". Will this be a problem when I pre-create the regions? > > > @michel > Schema is simple.. > one column family... in which we'll insert a max of 10 columns. 4 columns > are compulsory. and other 6 cols are sparsely filled. > > KEY: a string of 50 Characters > Col1: int > Col2: string of 20 characters > col3: string of 20 characters > col4 : int > col5 : int [ sparse ] > col6: float [sparse] > col7: string of 3 char [sparse] > col8: string of 3 char [sparse] > col9: string of 3 char [sparse] > > I've kept max.reduce.tasks = 16 .. > > Haven't set MSLABS.. what values do you recommend for my cluster. > > > "10k rows in a batch put() not really a good idea." > Hmm.. should it be less or more ? > > > "What's your region size?" > I did not set hbase.hregion.max.filesize manually.. please recommend. > neither > did i pre-create regions.. > > I'm not saying PIG will be a bottleneck.. The Output format / > configurations of hbase /hardware can be... need suggestions on the same... > > Can I use HFileOutputFormat in this case? can i get some example snippets? > > Thanks > Raj > > On Thu, Apr 26, 2012 at 7:11 PM, Michel Segel <[EMAIL PROTECTED] > >wrote: > > > Ok... > > 5 machines... > > Total cluster? Is that 5 DN? > > Each machine 1quad core, 32gb ram, 7 x600GB not sure what types of > drives. > > > > > > so let's assume 1control node running NN, JT, HM, ZK > > And 4 DN running DN,TT,RS. > > > > We don't know your Schema, row size, or network. ( 10GBe, 1GBe, 100MBe?) > > > > We also don't know if you've tuned GC implemented MSLABS ... Etc. > > > > So 4 hours for 175Million rows? Could be ok. > > Write your insert using a java M/R and see how long it takes. > > > > Nor do we know how many. Slots you have on each box. > > 10k rows in a batch put() not really a good idea. > > What's your region size? > > > > > > Lots to think about before you can ask if you are doing the right thing, > > or if PIG is the bottleneck. > > > > > > Sent from a remote device. Please excuse any typos... > > > > Mike Segel > > > > On Apr 26, 2012, at 7:09 AM, Rajgopal Vaithiyanathan < > [EMAIL PROTECTED]> > > wrote: > > > > > My bad. > > > > > > I had used cat /proc/cpuinfo | grep "processor" | wc -l > > > cat /proc/cpuinfo | grep “physical id” | sort | uniq | wc -l => 4 > > > > > > so its 4 physical cores then! > > > > > > and free -m gives me this. > > > total used free shared buffers > cached > > > Mem: 32174 31382 792 0 123 > 27339 > > > -/+ buffers/cache: 3918 28256 > > > Swap: 24575 0 24575 > > > > > > > > > > > > On Thu, Apr 26, 2012 at 5:18 PM, Michel Segel < > [EMAIL PROTECTED] > > >wrote: > > > > > >> 32 cores w 32GB of Ram? > > >> > > >> Pig isn't fast, but I have to question what you are using for > hardware. > > >> Who makes a 32 core box? > > >> Assuming you mean 16 physical cores. > > >> > > >> 7 drives? Not enough spindles for the number of cores. > > >> > > >> Sent from a remote device. Please excuse any typos... > > >> > > >> Mike Segel > > >> > > >> On Apr 26, 2012, at 6:38 AM, Rajgopal Vaithiyanathan < > > [EMAIL PROTECTED]> > > >> wrote: > > >> > > >>> Hey all, > > >>> > > >>> The default - HBaseStorage() takes hell lot of time for puts. > > >>> > > >>> In a cluster of 5 machines, insertion of 175 Million records took
+
Raghu Angadi 2012-04-27, 16:38
-
Re: Best Hbase Storage for PIG
Rajgopal Vaithiyanathan 2012-04-28, 07:08
@raghu, good idea. will do the scan benchmark soon.. On Fri, Apr 27, 2012 at 10:08 PM, Raghu Angadi <[EMAIL PROTECTED]> wrote:
> A lot of factors can affect HBase performance.. could even be some hardware > related (slow network, or disk).. > > How fast can you scan? does that work well? > You could take jstack of the clients (reducer) and region servers when you > are writing and post them and/or hbase list. This would point to where the > bottleneck is. > > Raghu. > > On Thu, Apr 26, 2012 at 11:13 PM, Rajgopal Vaithiyanathan < > [EMAIL PROTECTED]> wrote: > > > @doug > > Regarding monotonically increasing keys, I took care by randomizing the > > data order. > > Regarding pre-created regions - did not know i can do that. Thanks. > > But when i looked into the case studys, the section "HBase Region With > > Non-Local Data". Will this be a problem when I pre-create the regions? > > > > > > @michel > > Schema is simple.. > > one column family... in which we'll insert a max of 10 columns. 4 columns > > are compulsory. and other 6 cols are sparsely filled. > > > > KEY: a string of 50 Characters > > Col1: int > > Col2: string of 20 characters > > col3: string of 20 characters > > col4 : int > > col5 : int [ sparse ] > > col6: float [sparse] > > col7: string of 3 char [sparse] > > col8: string of 3 char [sparse] > > col9: string of 3 char [sparse] > > > > I've kept max.reduce.tasks = 16 .. > > > > Haven't set MSLABS.. what values do you recommend for my cluster. > > > > > "10k rows in a batch put() not really a good idea." > > Hmm.. should it be less or more ? > > > > > "What's your region size?" > > I did not set hbase.hregion.max.filesize manually.. please recommend. > > neither > > did i pre-create regions.. > > > > I'm not saying PIG will be a bottleneck.. The Output format / > > configurations of hbase /hardware can be... need suggestions on the > same... > > > > Can I use HFileOutputFormat in this case? can i get some example > snippets? > > > > Thanks > > Raj > > > > On Thu, Apr 26, 2012 at 7:11 PM, Michel Segel <[EMAIL PROTECTED] > > >wrote: > > > > > Ok... > > > 5 machines... > > > Total cluster? Is that 5 DN? > > > Each machine 1quad core, 32gb ram, 7 x600GB not sure what types of > > drives. > > > > > > > > > so let's assume 1control node running NN, JT, HM, ZK > > > And 4 DN running DN,TT,RS. > > > > > > We don't know your Schema, row size, or network. ( 10GBe, 1GBe, > 100MBe?) > > > > > > We also don't know if you've tuned GC implemented MSLABS ... Etc. > > > > > > So 4 hours for 175Million rows? Could be ok. > > > Write your insert using a java M/R and see how long it takes. > > > > > > Nor do we know how many. Slots you have on each box. > > > 10k rows in a batch put() not really a good idea. > > > What's your region size? > > > > > > > > > Lots to think about before you can ask if you are doing the right > thing, > > > or if PIG is the bottleneck. > > > > > > > > > Sent from a remote device. Please excuse any typos... > > > > > > Mike Segel > > > > > > On Apr 26, 2012, at 7:09 AM, Rajgopal Vaithiyanathan < > > [EMAIL PROTECTED]> > > > wrote: > > > > > > > My bad. > > > > > > > > I had used cat /proc/cpuinfo | grep "processor" | wc -l > > > > cat /proc/cpuinfo | grep “physical id” | sort | uniq | wc -l => 4 > > > > > > > > so its 4 physical cores then! > > > > > > > > and free -m gives me this. > > > > total used free shared buffers > > cached > > > > Mem: 32174 31382 792 0 123 > > 27339 > > > > -/+ buffers/cache: 3918 28256 > > > > Swap: 24575 0 24575 > > > > > > > > > > > > > > > > On Thu, Apr 26, 2012 at 5:18 PM, Michel Segel < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > >> 32 cores w 32GB of Ram? > > > >> > > > >> Pig isn't fast, but I have to question what you are using for > > hardware. > > > >> Who makes a 32 core box? > > > >> Assuming you mean 16 physical cores. > > > >> > > > > Thanks and Regards, Rajgopal Vaithiyanathan.
+
Rajgopal Vaithiyanathan 2012-04-28, 07:08
-
Re: Best Hbase Storage for PIG
M. C. Srivas 2012-04-28, 15:16
On Thu, Apr 26, 2012 at 4:38 AM, Rajgopal Vaithiyanathan < [EMAIL PROTECTED]> wrote:
> Hey all, > > The default - HBaseStorage() takes hell lot of time for puts. > > In a cluster of 5 machines, insertion of 175 Million records took 4Hours 45 > minutes > Question - Is this good enough ? > each machine has 32 cores and 32GB ram with 7*600GB harddisks. HBASE's heap > has been configured to 8GB. > If the put speed is low, how can i improve them..? >
Raj, how big is each record?
> > I tried tweaking the TableOutputFormat by increasing the WriteBufferSize to > 24MB, and adding the multi put feature (by adding 10,000 puts in ArrayList > and putting it as a batch). After doing this, it started throwing > >
+
M. C. Srivas 2012-04-28, 15:16
-
Re: Best Hbase Storage for PIG
Subir S 2012-05-12, 08:56
Could it be that you could use Completebulkload and see if that works....That must be faster...than HBaseStorage.....you could pre-split using
export HADOOP_CLASSPATH=`hbase classpath`;hbase org.apache.hadoop.hbase.util.RegionSplitter -c 10 '<table_name>' -f <cf name>
On Sat, Apr 28, 2012 at 8:46 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote:
> On Thu, Apr 26, 2012 at 4:38 AM, Rajgopal Vaithiyanathan < > [EMAIL PROTECTED]> wrote: > > > Hey all, > > > > The default - HBaseStorage() takes hell lot of time for puts. > > > > In a cluster of 5 machines, insertion of 175 Million records took 4Hours > 45 > > minutes > > Question - Is this good enough ? > > each machine has 32 cores and 32GB ram with 7*600GB harddisks. HBASE's > heap > > has been configured to 8GB. > > If the put speed is low, how can i improve them..? > > > > Raj, how big is each record? > > > > > > > I tried tweaking the TableOutputFormat by increasing the WriteBufferSize > to > > 24MB, and adding the multi put feature (by adding 10,000 puts in > ArrayList > > and putting it as a batch). After doing this, it started throwing > > > > >
+
Subir S 2012-05-12, 08:56
-
Re: Best Hbase Storage for PIG
Doug Meil 2012-04-26, 13:04
Hi there, as a sanity check with respect to writing have you double-checked this section of the RefGuide.. http://hbase.apache.org/book.html#perf.writing... regarding pre-created regions and monotonically increasing keys? Also as a sanity check refer to this case study as a diagnostic roadmap.. http://hbase.apache.org/book.html#casestudies.perftroubOn 4/26/12 7:38 AM, "Rajgopal Vaithiyanathan" <[EMAIL PROTECTED]> wrote: >Hey all, > >The default - HBaseStorage() takes hell lot of time for puts. > >In a cluster of 5 machines, insertion of 175 Million records took 4Hours >45 >minutes >Question - Is this good enough ? >each machine has 32 cores and 32GB ram with 7*600GB harddisks. HBASE's >heap >has been configured to 8GB. >If the put speed is low, how can i improve them..? > >I tried tweaking the TableOutputFormat by increasing the WriteBufferSize >to >24MB, and adding the multi put feature (by adding 10,000 puts in ArrayList >and putting it as a batch). After doing this, it started throwing > >java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: >Call to slave1/172.21.208.176:60020 failed on socket timeout exception: >java.net.SocketTimeoutException: 60000 millis timeout while waiting for >channel to be ready for read. ch : >java.nio.channels.SocketChannel[connected >local=/172.21.208.176:41135remote=slave1/ >172.21.208.176:60020] > >Which i assume is because, the clients took too long to put. > >The detailed log is as follows from one of the reduce job is as follows. > >I've 'censored' some of the details. which i assume is Okay.! :P >2012-04-23 20:07:12,815 INFO org.apache.hadoop.util.NativeCodeLoader: >Loaded the native-hadoop library >2012-04-23 20:07:13,097 WARN >org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already >exists! >2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >environment:zookeeper.version=3.4.2-1221870, built on 12/21/2011 20:46 GMT >2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >environment:host.name=*****.***** >2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >environment:java.version=1.6.0_22 >2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >environment:java.vendor=Sun Microsystems Inc. >2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >environment:java.home=/usr/lib/jvm/java-6-openjdk/jre >2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client >environment:java.class.path=**************************** >2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client >environment:java.library.path=********************** >2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client >environment:java.io.tmpdir=*************************** >2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client >environment:java.compiler=<NA> >2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client >environment:os.name=Linux >2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client >environment:os.arch=amd64 >2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client >environment:os.version=2.6.38-8-server >2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client >environment:user.name=raj > >2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client >environment:user.home=********* >2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client >environment:user.dir=**********************: >2012-04-23 20:07:13,790 INFO org.apache.zookeeper.ZooKeeper: Initiating >client connection, connectString=master:2181 sessionTimeout=180000 >watcher=hconnection >2012-04-23 20:07:13,822 INFO org.apache.zookeeper.ClientCnxn: Opening >socket connection to server /172.21.208.180:2181 >2012-04-23 20:07:13,823 INFO >org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: The identifier of >this process is [EMAIL PROTECTED]e1 >2012-04-23 20:07:13,825 INFO org.apache.zookeeper.ClientCnxn: Socket >connection established to master/172.21.208.180:2181, initiating session
+
Doug Meil 2012-04-26, 13:04
|
|