|
Yousuf Ahmad
2012-10-18, 18:26
Pamecha, Abhishek
2012-10-18, 18:38
lars hofhansl
2012-10-18, 18:42
Yousuf Ahmad
2012-10-18, 19:59
lars hofhansl
2012-10-19, 05:25
Yousuf Ahmad
2012-10-19, 17:12
Pamecha, Abhishek
2012-10-19, 17:20
Yousuf Ahmad
2012-10-19, 18:21
Yousuf Ahmad
2012-10-30, 01:07
lars hofhansl
2012-10-30, 05:14
Yousuf Ahmad
2012-10-30, 11:06
Vincent Barat
2012-11-20, 18:57
Ramkrishna.S.Vasudevan
2012-10-19, 04:38
|
-
High IPC LatencyYousuf Ahmad 2012-10-18, 18:26
Hello,
We are seeing slow times for read operations in our experiments. We are hoping that you guys can help us figure out what's going wrong. Here are some details: - We are running a read-only benchmark on our HBase cluster. - - There are 10 regionservers, each co-located with a datanode. HDFS replication is 3x. - All the data read by the experiment is already in the block cache and the hit ratio is 99%. - - We have 10 clients, each with around 400 threads making a mix of read-only requests involving multi-gets and scans. - - We settled on the default client pool type/size (roundrobin/1) and a regionserver handler count of 100 after testing various combinations to see what setting worked best. - - Our scans are short, fetching around 10 rows on average. Scanner caching is set to 50. - An average row in a scan has either around 10 columns (small row) or around 200 columns (big row). - - Our multi-gets fetch around 200 rows on average. - An average row in a multi-get has around 10 columns. - Each column holds an integer (encoded into bytes). - - None of the machines involved reach CPU, memory, or IO saturation. In fact resource utilization stays quite low. - - Our statistics show that the average time for a scan, measured starting from the first scanner.next() call to the last one which returns a null, is around 2-3 seconds. - Since we use scanner caching, the major portion of this time (around 2 seconds) is spent on the first call to next(), while the remaining calls take a negligible amount of time. - Similarly, we see that a multi-get on average takes around 2 seconds. - A single get on average takes around 1 second. We are not sure what the bottleneck is or where it lies. We thought we should look deeper into what is going on at the regionservers. We monitored the IPC calls during one of the experiments. Here is a sample of one regionserver log: 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115483; Served: HRegionInterface#get queueTime=0 processingTime=1 contents=1 Get, 75 bytes 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115487; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115489; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,982 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #111421; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,982 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115497; Served: HRegionInterface#multi queueTime=0 processingTime=9 contents=200 Gets 2012-10-18 17:00:09,984 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115499; Served: HRegionInterface#openScanner queueTime=0 processingTime=0 contents=1 Scan, 63 bytes 2012-10-18 17:00:09,990 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115503; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,992 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103230; Served: HRegionInterface#next queueTime=0 processingTime=0 contents=1 Long, 1 Integer 2012-10-18 17:00:09,994 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103234; Served: HRegionInterface#close queueTime=0 processingTime=0 contents=1 Long 2012-10-18 17:00:09,994 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103232; Served: HRegionInterface#next queueTime=0 processingTime=0 contents=1 Long, 1 Integer I have attached a larger chunk of the logs we collected for this experiment in case that helps. >From the logs, we saw that the next() operation at the regionserver takes 1 millisecond or less; and a multi-get takes 10 ms on average. Yet the corresponding times we see at the client are orders of magnitude higher. Ping times between the machines are at most 1ms and we are not saturating the network. We would really appreciate some insights from you guys on this. Where do you suggest we focus our efforts in order to hunt down this bottleneck/contention? Thanks! Yousuf +
Yousuf Ahmad 2012-10-18, 18:26
-
RE: High IPC LatencyPamecha, Abhishek 2012-10-18, 18:38
Is it sustained for the same client hitting the same region server OR does it get better for the same client-RS combination when run for longer duration? Trying to eliminate Zookeeper from this.
Thanks, Abhishek From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] Sent: Thursday, October 18, 2012 11:26 AM To: [EMAIL PROTECTED] Cc: Ivan Brondino; Ricardo Vilaça Subject: High IPC Latency Hello, We are seeing slow times for read operations in our experiments. We are hoping that you guys can help us figure out what's going wrong. Here are some details: * We are running a read-only benchmark on our HBase cluster. * * There are 10 regionservers, each co-located with a datanode. HDFS replication is 3x. * All the data read by the experiment is already in the block cache and the hit ratio is 99%. * * We have 10 clients, each with around 400 threads making a mix of read-only requests involving multi-gets and scans. * * We settled on the default client pool type/size (roundrobin/1) and a regionserver handler count of 100 after testing various combinations to see what setting worked best. * * Our scans are short, fetching around 10 rows on average. Scanner caching is set to 50. * An average row in a scan has either around 10 columns (small row) or around 200 columns (big row). * * Our multi-gets fetch around 200 rows on average. * An average row in a multi-get has around 10 columns. * Each column holds an integer (encoded into bytes). * * None of the machines involved reach CPU, memory, or IO saturation. In fact resource utilization stays quite low. * * Our statistics show that the average time for a scan, measured starting from the first scanner.next() call to the last one which returns a null, is around 2-3 seconds. * Since we use scanner caching, the major portion of this time (around 2 seconds) is spent on the first call to next(), while the remaining calls take a negligible amount of time. * Similarly, we see that a multi-get on average takes around 2 seconds. * A single get on average takes around 1 second. We are not sure what the bottleneck is or where it lies. We thought we should look deeper into what is going on at the regionservers. We monitored the IPC calls during one of the experiments. Here is a sample of one regionserver log: 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115483; Served: HRegionInterface#get queueTime=0 processingTime=1 contents=1 Get, 75 bytes 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115487; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115489; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,982 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #111421; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,982 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115497; Served: HRegionInterface#multi queueTime=0 processingTime=9 contents=200 Gets 2012-10-18 17:00:09,984 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115499; Served: HRegionInterface#openScanner queueTime=0 processingTime=0 contents=1 Scan, 63 bytes 2012-10-18 17:00:09,990 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115503; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,992 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103230; Served: HRegionInterface#next queueTime=0 processingTime=0 contents=1 Long, 1 Integer 2012-10-18 17:00:09,994 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103234; Served: HRegionInterface#close queueTime=0 processingTime=0 contents=1 Long 2012-10-18 17:00:09,994 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103232; Served: HRegionInterface#next queueTime=0 processingTime=0 contents=1 Long, 1 Integer I have attached a larger chunk of the logs we collected for this experiment in case that helps. Yet the corresponding times we see at the client are orders of magnitude higher. Ping times between the machines are at most 1ms and we are not saturating the network. We would really appreciate some insights from you guys on this. Where do you suggest we focus our efforts in order to hunt down this bottleneck/contention? Thanks! Yousuf +
Pamecha, Abhishek 2012-10-18, 18:38
-
Re: High IPC Latencylars hofhansl 2012-10-18, 18:42
Also, what version of HBase/HDFS is this using?
----- Original Message ----- From: "Pamecha, Abhishek" <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça <[EMAIL PROTECTED]> Sent: Thursday, October 18, 2012 11:38 AM Subject: RE: High IPC Latency Is it sustained for the same client hitting the same region server OR does it get better for the same client-RS combination when run for longer duration? Trying to eliminate Zookeeper from this. Thanks, Abhishek From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] Sent: Thursday, October 18, 2012 11:26 AM To: [EMAIL PROTECTED] Cc: Ivan Brondino; Ricardo Vilaça Subject: High IPC Latency Hello, We are seeing slow times for read operations in our experiments. We are hoping that you guys can help us figure out what's going wrong. Here are some details: * We are running a read-only benchmark on our HBase cluster. * * There are 10 regionservers, each co-located with a datanode. HDFS replication is 3x. * All the data read by the experiment is already in the block cache and the hit ratio is 99%. * * We have 10 clients, each with around 400 threads making a mix of read-only requests involving multi-gets and scans. * * We settled on the default client pool type/size (roundrobin/1) and a regionserver handler count of 100 after testing various combinations to see what setting worked best. * * Our scans are short, fetching around 10 rows on average. Scanner caching is set to 50. * An average row in a scan has either around 10 columns (small row) or around 200 columns (big row). * * Our multi-gets fetch around 200 rows on average. * An average row in a multi-get has around 10 columns. * Each column holds an integer (encoded into bytes). * * None of the machines involved reach CPU, memory, or IO saturation. In fact resource utilization stays quite low. * * Our statistics show that the average time for a scan, measured starting from the first scanner.next() call to the last one which returns a null, is around 2-3 seconds. * Since we use scanner caching, the major portion of this time (around 2 seconds) is spent on the first call to next(), while the remaining calls take a negligible amount of time. * Similarly, we see that a multi-get on average takes around 2 seconds. * A single get on average takes around 1 second. We are not sure what the bottleneck is or where it lies. We thought we should look deeper into what is going on at the regionservers. We monitored the IPC calls during one of the experiments. Here is a sample of one regionserver log: 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115483; Served: HRegionInterface#get queueTime=0 processingTime=1 contents=1 Get, 75 bytes 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115487; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115489; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,982 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #111421; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,982 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115497; Served: HRegionInterface#multi queueTime=0 processingTime=9 contents=200 Gets 2012-10-18 17:00:09,984 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115499; Served: HRegionInterface#openScanner queueTime=0 processingTime=0 contents=1 Scan, 63 bytes 2012-10-18 17:00:09,990 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115503; Served: HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes 2012-10-18 17:00:09,992 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103230; Served: HRegionInterface#next queueTime=0 processingTime=0 contents=1 Long, 1 Integer 2012-10-18 17:00:09,994 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103234; Served: HRegionInterface#close queueTime=0 processingTime=0 contents=1 Long 2012-10-18 17:00:09,994 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103232; Served: HRegionInterface#next queueTime=0 processingTime=0 contents=1 Long, 1 Integer I have attached a larger chunk of the logs we collected for this experiment in case that helps. From the logs, we saw that the next() operation at the regionserver takes 1 millisecond or less; and a multi-get takes 10 ms on average. Yet the corresponding times we see at the client are orders of magnitude higher. Ping times between the machines are at most 1ms and we are not saturating the network. We would really appreciate some insights from you guys on this. Where do you suggest we focus our efforts in order to hunt down this bottleneck/contention? Thanks! Yousuf +
lars hofhansl 2012-10-18, 18:42
-
Re: High IPC LatencyYousuf Ahmad 2012-10-18, 19:59
Hi,
Thank you for your questions guys. We are using HBase 0.92 with HDFS 1.0.1. The experiment lasts 15 minutes. The measurements stabilize in the first two minutes of the run. The data is distributed almost evenly across the regionservers so each client hits most of them over the course of the experiment. However, for the data we have, any given multi-get or scan should touch only one or at most two regions. The client caches the locations of the regionservers, so after a couple of minutes of the experiment running, it wouldn't need to re-visit ZooKeeper, I believe. Correct me if I am wrong please. Regards, Yousuf On Thu, Oct 18, 2012 at 2:42 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Also, what version of HBase/HDFS is this using? > > > > > ----- Original Message ----- > From: "Pamecha, Abhishek" <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > [EMAIL PROTECTED]> > Sent: Thursday, October 18, 2012 11:38 AM > Subject: RE: High IPC Latency > > Is it sustained for the same client hitting the same region server OR does > it get better for the same client-RS combination when run for longer > duration? Trying to eliminate Zookeeper from this. > > Thanks, > Abhishek > > From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] > Sent: Thursday, October 18, 2012 11:26 AM > To: [EMAIL PROTECTED] > Cc: Ivan Brondino; Ricardo Vilaça > Subject: High IPC Latency > > Hello, > > We are seeing slow times for read operations in our experiments. We are > hoping that you guys can help us figure out what's going wrong. > > Here are some details: > > * We are running a read-only benchmark on our HBase cluster. > * > * There are 10 regionservers, each co-located with a datanode. HDFS > replication is 3x. > * All the data read by the experiment is already in the block cache > and the hit ratio is 99%. > * > * We have 10 clients, each with around 400 threads making a mix of > read-only requests involving multi-gets and scans. > * > * We settled on the default client pool type/size (roundrobin/1) and a > regionserver handler count of 100 after testing various combinations to see > what setting worked best. > * > * Our scans are short, fetching around 10 rows on average. Scanner > caching is set to 50. > * An average row in a scan has either around 10 columns (small row) or > around 200 columns (big row). > * > * Our multi-gets fetch around 200 rows on average. > * An average row in a multi-get has around 10 columns. > * Each column holds an integer (encoded into bytes). > * > * None of the machines involved reach CPU, memory, or IO saturation. > In fact resource utilization stays quite low. > * > * Our statistics show that the average time for a scan, measured > starting from the first scanner.next() call to the last one which returns a > null, is around 2-3 seconds. > * Since we use scanner caching, the major portion of this time (around > 2 seconds) is spent on the first call to next(), while the remaining calls > take a negligible amount of time. > * Similarly, we see that a multi-get on average takes around 2 seconds. > * A single get on average takes around 1 second. > We are not sure what the bottleneck is or where it lies. We thought we > should look deeper into what is going on at the regionservers. We monitored > the IPC calls during one of the experiments. Here is a sample of one > regionserver log: > > 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: > Call #115483; Served: HRegionInterface#get queueTime=0 processingTime=1 > contents=1 Get, 75 bytes > 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: > Call #115487; Served: HRegionInterface#get queueTime=0 processingTime=0 > contents=1 Get, 75 bytes > 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: > Call #115489; Served: HRegionInterface#get queueTime=0 processingTime=0 +
Yousuf Ahmad 2012-10-18, 19:59
-
Re: High IPC Latencylars hofhansl 2012-10-19, 05:25
Can you reproduce this against a single, local region server?
Any chance that you can try with the just released 0.94.2? I would love to debug this. If would be a tremendous help if you had a little test program that reproduces this against a single server, so that I can see what is going on. Thanks. -- Lars ----- Original Message ----- From: Yousuf Ahmad <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça <[EMAIL PROTECTED]> Sent: Thursday, October 18, 2012 12:59 PM Subject: Re: High IPC Latency Hi, Thank you for your questions guys. We are using HBase 0.92 with HDFS 1.0.1. The experiment lasts 15 minutes. The measurements stabilize in the first two minutes of the run. The data is distributed almost evenly across the regionservers so each client hits most of them over the course of the experiment. However, for the data we have, any given multi-get or scan should touch only one or at most two regions. The client caches the locations of the regionservers, so after a couple of minutes of the experiment running, it wouldn't need to re-visit ZooKeeper, I believe. Correct me if I am wrong please. Regards, Yousuf On Thu, Oct 18, 2012 at 2:42 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Also, what version of HBase/HDFS is this using? > > > > > ----- Original Message ----- > From: "Pamecha, Abhishek" <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > [EMAIL PROTECTED]> > Sent: Thursday, October 18, 2012 11:38 AM > Subject: RE: High IPC Latency > > Is it sustained for the same client hitting the same region server OR does > it get better for the same client-RS combination when run for longer > duration? Trying to eliminate Zookeeper from this. > > Thanks, > Abhishek > > From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] > Sent: Thursday, October 18, 2012 11:26 AM > To: [EMAIL PROTECTED] > Cc: Ivan Brondino; Ricardo Vilaça > Subject: High IPC Latency > > Hello, > > We are seeing slow times for read operations in our experiments. We are > hoping that you guys can help us figure out what's going wrong. > > Here are some details: > > * We are running a read-only benchmark on our HBase cluster. > * > * There are 10 regionservers, each co-located with a datanode. HDFS > replication is 3x. > * All the data read by the experiment is already in the block cache > and the hit ratio is 99%. > * > * We have 10 clients, each with around 400 threads making a mix of > read-only requests involving multi-gets and scans. > * > * We settled on the default client pool type/size (roundrobin/1) and a > regionserver handler count of 100 after testing various combinations to see > what setting worked best. > * > * Our scans are short, fetching around 10 rows on average. Scanner > caching is set to 50. > * An average row in a scan has either around 10 columns (small row) or > around 200 columns (big row). > * > * Our multi-gets fetch around 200 rows on average. > * An average row in a multi-get has around 10 columns. > * Each column holds an integer (encoded into bytes). > * > * None of the machines involved reach CPU, memory, or IO saturation. > In fact resource utilization stays quite low. > * > * Our statistics show that the average time for a scan, measured > starting from the first scanner.next() call to the last one which returns a > null, is around 2-3 seconds. > * Since we use scanner caching, the major portion of this time (around > 2 seconds) is spent on the first call to next(), while the remaining calls > take a negligible amount of time. > * Similarly, we see that a multi-get on average takes around 2 seconds. > * A single get on average takes around 1 second. > We are not sure what the bottleneck is or where it lies. We thought we > should look deeper into what is going on at the regionservers. We monitored +
lars hofhansl 2012-10-19, 05:25
-
Re: High IPC LatencyYousuf Ahmad 2012-10-19, 17:12
Hi Lars,
We are following your suggestion and testing against a single region server. We just ran a test against a remote region server and soon we will test against a local one as well. We will get back to you soon with the results. It will take us a couple of days to port to and test our code with 0.94.2. Once we have it working, we will run some experiments and update this thread. Unfortunately, the nature of our project is such that we cannot easily translate the benchmark's workload into a program executing the equivalent HBase operations directly. For this reason, I attempted to roughly translate the workload in terms of HBase operations in my first email and I attached a portion of the logs to be a bit more concrete. Your assistance is very much appreciated! Thank you! We'll keep you updated. Best regards, Yousuf On Fri, Oct 19, 2012 at 1:25 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Can you reproduce this against a single, local region server? > Any chance that you can try with the just released 0.94.2? > > > I would love to debug this. If would be a tremendous help if you had a > little test program that reproduces this against a single server, so that I > can see what is going on. > > Thanks. > > -- Lars > > > > ----- Original Message ----- > From: Yousuf Ahmad <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > [EMAIL PROTECTED]> > Sent: Thursday, October 18, 2012 12:59 PM > Subject: Re: High IPC Latency > > Hi, > > Thank you for your questions guys. > > We are using HBase 0.92 with HDFS 1.0.1. > > The experiment lasts 15 minutes. The measurements stabilize in the first > two minutes of the run. > > The data is distributed almost evenly across the regionservers so each > client hits most of them over the course of the experiment. However, for > the data we have, any given multi-get or scan should touch only one or at > most two regions. > > The client caches the locations of the regionservers, so after a couple of > minutes of the experiment running, it wouldn't need to re-visit ZooKeeper, > I believe. Correct me if I am wrong please. > > Regards, > Yousuf > > > On Thu, Oct 18, 2012 at 2:42 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > Also, what version of HBase/HDFS is this using? > > > > > > > > > > ----- Original Message ----- > > From: "Pamecha, Abhishek" <[EMAIL PROTECTED]> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > > [EMAIL PROTECTED]> > > Sent: Thursday, October 18, 2012 11:38 AM > > Subject: RE: High IPC Latency > > > > Is it sustained for the same client hitting the same region server OR > does > > it get better for the same client-RS combination when run for longer > > duration? Trying to eliminate Zookeeper from this. > > > > Thanks, > > Abhishek > > > > From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, October 18, 2012 11:26 AM > > To: [EMAIL PROTECTED] > > Cc: Ivan Brondino; Ricardo Vilaça > > Subject: High IPC Latency > > > > Hello, > > > > We are seeing slow times for read operations in our experiments. We are > > hoping that you guys can help us figure out what's going wrong. > > > > Here are some details: > > > > * We are running a read-only benchmark on our HBase cluster. > > * > > * There are 10 regionservers, each co-located with a datanode. HDFS > > replication is 3x. > > * All the data read by the experiment is already in the block cache > > and the hit ratio is 99%. > > * > > * We have 10 clients, each with around 400 threads making a mix of > > read-only requests involving multi-gets and scans. > > * > > * We settled on the default client pool type/size (roundrobin/1) and > a > > regionserver handler count of 100 after testing various combinations to > see > > what setting worked best. > > * > > * Our scans are short, fetching around 10 rows on average. Scanner +
Yousuf Ahmad 2012-10-19, 17:12
-
RE: High IPC LatencyPamecha, Abhishek 2012-10-19, 17:20
Also, I hope no coprocessors are in play.
Thanks, Abhishek -----Original Message----- From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] Sent: Friday, October 19, 2012 10:12 AM To: [EMAIL PROTECTED] Cc: Ivan Brondino; Ricardo Vilaça Subject: Re: High IPC Latency Hi Lars, We are following your suggestion and testing against a single region server. We just ran a test against a remote region server and soon we will test against a local one as well. We will get back to you soon with the results. It will take us a couple of days to port to and test our code with 0.94.2. Once we have it working, we will run some experiments and update this thread. Unfortunately, the nature of our project is such that we cannot easily translate the benchmark's workload into a program executing the equivalent HBase operations directly. For this reason, I attempted to roughly translate the workload in terms of HBase operations in my first email and I attached a portion of the logs to be a bit more concrete. Your assistance is very much appreciated! Thank you! We'll keep you updated. Best regards, Yousuf On Fri, Oct 19, 2012 at 1:25 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Can you reproduce this against a single, local region server? > Any chance that you can try with the just released 0.94.2? > > > I would love to debug this. If would be a tremendous help if you had a > little test program that reproduces this against a single server, so > that I can see what is going on. > > Thanks. > > -- Lars > > > > ----- Original Message ----- > From: Yousuf Ahmad <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > [EMAIL PROTECTED]> > Sent: Thursday, October 18, 2012 12:59 PM > Subject: Re: High IPC Latency > > Hi, > > Thank you for your questions guys. > > We are using HBase 0.92 with HDFS 1.0.1. > > The experiment lasts 15 minutes. The measurements stabilize in the > first two minutes of the run. > > The data is distributed almost evenly across the regionservers so each > client hits most of them over the course of the experiment. However, > for the data we have, any given multi-get or scan should touch only > one or at most two regions. > > The client caches the locations of the regionservers, so after a > couple of minutes of the experiment running, it wouldn't need to > re-visit ZooKeeper, I believe. Correct me if I am wrong please. > > Regards, > Yousuf > > > On Thu, Oct 18, 2012 at 2:42 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > Also, what version of HBase/HDFS is this using? > > > > > > > > > > ----- Original Message ----- > > From: "Pamecha, Abhishek" <[EMAIL PROTECTED]> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > > [EMAIL PROTECTED]> > > Sent: Thursday, October 18, 2012 11:38 AM > > Subject: RE: High IPC Latency > > > > Is it sustained for the same client hitting the same region server > > OR > does > > it get better for the same client-RS combination when run for longer > > duration? Trying to eliminate Zookeeper from this. > > > > Thanks, > > Abhishek > > > > From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, October 18, 2012 11:26 AM > > To: [EMAIL PROTECTED] > > Cc: Ivan Brondino; Ricardo Vilaça > > Subject: High IPC Latency > > > > Hello, > > > > We are seeing slow times for read operations in our experiments. We > > are hoping that you guys can help us figure out what's going wrong. > > > > Here are some details: > > > > * We are running a read-only benchmark on our HBase cluster. > > * > > * There are 10 regionservers, each co-located with a datanode. HDFS > > replication is 3x. > > * All the data read by the experiment is already in the block cache > > and the hit ratio is 99%. > > * > > * We have 10 clients, each with around 400 threads making a mix of > > read-only requests involving multi-gets and scans. +
Pamecha, Abhishek 2012-10-19, 17:20
-
RE: High IPC LatencyYousuf Ahmad 2012-10-19, 18:21
No coprocessors :-)
On Oct 19, 2012 1:21 PM, "Pamecha, Abhishek" <[EMAIL PROTECTED]> wrote: > Also, I hope no coprocessors are in play. > > Thanks, > Abhishek > > > -----Original Message----- > From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] > Sent: Friday, October 19, 2012 10:12 AM > To: [EMAIL PROTECTED] > Cc: Ivan Brondino; Ricardo Vilaça > Subject: Re: High IPC Latency > > Hi Lars, > > We are following your suggestion and testing against a single region > server. We just ran a test against a remote region server and soon we will > test against a local one as well. We will get back to you soon with the > results. > > It will take us a couple of days to port to and test our code with 0.94.2. > Once we have it working, we will run some experiments and update this > thread. > > Unfortunately, the nature of our project is such that we cannot easily > translate the benchmark's workload into a program executing the equivalent > HBase operations directly. For this reason, I attempted to roughly > translate the workload in terms of HBase operations in my first email and I > attached a portion of the logs to be a bit more concrete. > > Your assistance is very much appreciated! Thank you! We'll keep you > updated. > > Best regards, > Yousuf > > > On Fri, Oct 19, 2012 at 1:25 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > Can you reproduce this against a single, local region server? > > Any chance that you can try with the just released 0.94.2? > > > > > > I would love to debug this. If would be a tremendous help if you had a > > little test program that reproduces this against a single server, so > > that I can see what is going on. > > > > Thanks. > > > > -- Lars > > > > > > > > ----- Original Message ----- > > From: Yousuf Ahmad <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > > [EMAIL PROTECTED]> > > Sent: Thursday, October 18, 2012 12:59 PM > > Subject: Re: High IPC Latency > > > > Hi, > > > > Thank you for your questions guys. > > > > We are using HBase 0.92 with HDFS 1.0.1. > > > > The experiment lasts 15 minutes. The measurements stabilize in the > > first two minutes of the run. > > > > The data is distributed almost evenly across the regionservers so each > > client hits most of them over the course of the experiment. However, > > for the data we have, any given multi-get or scan should touch only > > one or at most two regions. > > > > The client caches the locations of the regionservers, so after a > > couple of minutes of the experiment running, it wouldn't need to > > re-visit ZooKeeper, I believe. Correct me if I am wrong please. > > > > Regards, > > Yousuf > > > > > > On Thu, Oct 18, 2012 at 2:42 PM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > > > > Also, what version of HBase/HDFS is this using? > > > > > > > > > > > > > > > ----- Original Message ----- > > > From: "Pamecha, Abhishek" <[EMAIL PROTECTED]> > > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > > > [EMAIL PROTECTED]> > > > Sent: Thursday, October 18, 2012 11:38 AM > > > Subject: RE: High IPC Latency > > > > > > Is it sustained for the same client hitting the same region server > > > OR > > does > > > it get better for the same client-RS combination when run for longer > > > duration? Trying to eliminate Zookeeper from this. > > > > > > Thanks, > > > Abhishek > > > > > > From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] > > > Sent: Thursday, October 18, 2012 11:26 AM > > > To: [EMAIL PROTECTED] > > > Cc: Ivan Brondino; Ricardo Vilaça > > > Subject: High IPC Latency > > > > > > Hello, > > > > > > We are seeing slow times for read operations in our experiments. We > > > are hoping that you guys can help us figure out what's going wrong. > > > > > > Here are some details: > > > > > > * We are running a read-only benchmark on our HBase cluster. > > > * > > > * There are 10 regionservers, each co-located with a datanode. HDFS +
Yousuf Ahmad 2012-10-19, 18:21
-
Re: High IPC LatencyYousuf Ahmad 2012-10-30, 01:07
Hi guys,
I have an update. Actually a question. @Override public synchronized Object[] batch(final List<Row> actions) throws InterruptedException, IOException { Object[] results = new Object[actions.size()]; connection.processBatch(actions, tableName, pool, results); return results; } This synchronized method in HTable seems to be causing contention among concurrent multigets. I suppose my question is: why must we have this method synchronized? Thanks, Yousuf On Fri, Oct 19, 2012 at 2:21 PM, Yousuf Ahmad <[EMAIL PROTECTED]> wrote: > No coprocessors :-) > On Oct 19, 2012 1:21 PM, "Pamecha, Abhishek" <[EMAIL PROTECTED]> wrote: > >> Also, I hope no coprocessors are in play. >> >> Thanks, >> Abhishek >> >> >> -----Original Message----- >> From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] >> Sent: Friday, October 19, 2012 10:12 AM >> To: [EMAIL PROTECTED] >> Cc: Ivan Brondino; Ricardo Vilaça >> Subject: Re: High IPC Latency >> >> Hi Lars, >> >> We are following your suggestion and testing against a single region >> server. We just ran a test against a remote region server and soon we will >> test against a local one as well. We will get back to you soon with the >> results. >> >> It will take us a couple of days to port to and test our code with 0.94.2. >> Once we have it working, we will run some experiments and update this >> thread. >> >> Unfortunately, the nature of our project is such that we cannot easily >> translate the benchmark's workload into a program executing the equivalent >> HBase operations directly. For this reason, I attempted to roughly >> translate the workload in terms of HBase operations in my first email and I >> attached a portion of the logs to be a bit more concrete. >> >> Your assistance is very much appreciated! Thank you! We'll keep you >> updated. >> >> Best regards, >> Yousuf >> >> >> On Fri, Oct 19, 2012 at 1:25 AM, lars hofhansl <[EMAIL PROTECTED]> >> wrote: >> >> > Can you reproduce this against a single, local region server? >> > Any chance that you can try with the just released 0.94.2? >> > >> > >> > I would love to debug this. If would be a tremendous help if you had a >> > little test program that reproduces this against a single server, so >> > that I can see what is going on. >> > >> > Thanks. >> > >> > -- Lars >> > >> > >> > >> > ----- Original Message ----- >> > From: Yousuf Ahmad <[EMAIL PROTECTED]> >> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >> > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < >> > [EMAIL PROTECTED]> >> > Sent: Thursday, October 18, 2012 12:59 PM >> > Subject: Re: High IPC Latency >> > >> > Hi, >> > >> > Thank you for your questions guys. >> > >> > We are using HBase 0.92 with HDFS 1.0.1. >> > >> > The experiment lasts 15 minutes. The measurements stabilize in the >> > first two minutes of the run. >> > >> > The data is distributed almost evenly across the regionservers so each >> > client hits most of them over the course of the experiment. However, >> > for the data we have, any given multi-get or scan should touch only >> > one or at most two regions. >> > >> > The client caches the locations of the regionservers, so after a >> > couple of minutes of the experiment running, it wouldn't need to >> > re-visit ZooKeeper, I believe. Correct me if I am wrong please. >> > >> > Regards, >> > Yousuf >> > >> > >> > On Thu, Oct 18, 2012 at 2:42 PM, lars hofhansl <[EMAIL PROTECTED]> >> > wrote: >> > >> > > Also, what version of HBase/HDFS is this using? >> > > >> > > >> > > >> > > >> > > ----- Original Message ----- >> > > From: "Pamecha, Abhishek" <[EMAIL PROTECTED]> >> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >> > > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < >> > > [EMAIL PROTECTED]> >> > > Sent: Thursday, October 18, 2012 11:38 AM >> > > Subject: RE: High IPC Latency >> > > >> > > Is it sustained for the same client hitting the same region server >> > > OR >> > does >> > > it get better for the same client-RS combination when run for longer +
Yousuf Ahmad 2012-10-30, 01:07
-
Re: High IPC Latencylars hofhansl 2012-10-30, 05:14
Whoa... Why indeed? Maybe there was a historic reason.
I filed HBASE-7069 to fix it. -- Lars ________________________________ From: Yousuf Ahmad <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> Sent: Monday, October 29, 2012 6:07 PM Subject: Re: High IPC Latency Hi guys, I have an update. Actually a question. @Override public synchronized Object[] batch(final List<Row> actions) throws InterruptedException, IOException { Object[] results = new Object[actions.size()]; connection.processBatch(actions, tableName, pool, results); return results; } This synchronized method in HTable seems to be causing contention among concurrent multigets. I suppose my question is: why must we have this method synchronized? Thanks, Yousuf On Fri, Oct 19, 2012 at 2:21 PM, Yousuf Ahmad <[EMAIL PROTECTED]> wrote: No coprocessors :-) >On Oct 19, 2012 1:21 PM, "Pamecha, Abhishek" <[EMAIL PROTECTED]> wrote: > >Also, I hope no coprocessors are in play. >> >>Thanks, >>Abhishek >> >> >>-----Original Message----- >>From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] >>Sent: Friday, October 19, 2012 10:12 AM >>To: [EMAIL PROTECTED] >>Cc: Ivan Brondino; Ricardo Vilaça >>Subject: Re: High IPC Latency >> >>Hi Lars, >> >>We are following your suggestion and testing against a single region server. We just ran a test against a remote region server and soon we will test against a local one as well. We will get back to you soon with the results. >> >>It will take us a couple of days to port to and test our code with 0.94.2. >>Once we have it working, we will run some experiments and update this thread. >> >>Unfortunately, the nature of our project is such that we cannot easily translate the benchmark's workload into a program executing the equivalent HBase operations directly. For this reason, I attempted to roughly translate the workload in terms of HBase operations in my first email and I attached a portion of the logs to be a bit more concrete. >> >>Your assistance is very much appreciated! Thank you! We'll keep you updated. >> >>Best regards, >>Yousuf >> >> >>On Fri, Oct 19, 2012 at 1:25 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: >> >>> Can you reproduce this against a single, local region server? >>> Any chance that you can try with the just released 0.94.2? >>> >>> >>> I would love to debug this. If would be a tremendous help if you had a >>> little test program that reproduces this against a single server, so >>> that I can see what is going on. >>> >>> Thanks. >>> >>> -- Lars >>> >>> >>> >>> ----- Original Message ----- >>> From: Yousuf Ahmad <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>> Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < >>> [EMAIL PROTECTED]> >>> Sent: Thursday, October 18, 2012 12:59 PM >>> Subject: Re: High IPC Latency >>> >>> Hi, >>> >>> Thank you for your questions guys. >>> >>> We are using HBase 0.92 with HDFS 1.0.1. >>> >>> The experiment lasts 15 minutes. The measurements stabilize in the >>> first two minutes of the run. >>> >>> The data is distributed almost evenly across the regionservers so each >>> client hits most of them over the course of the experiment. However, >>> for the data we have, any given multi-get or scan should touch only >>> one or at most two regions. >>> >>> The client caches the locations of the regionservers, so after a >>> couple of minutes of the experiment running, it wouldn't need to >>> re-visit ZooKeeper, I believe. Correct me if I am wrong please. >>> >>> Regards, >>> Yousuf >>> >>> >>> On Thu, Oct 18, 2012 at 2:42 PM, lars hofhansl <[EMAIL PROTECTED]> >>> wrote: >>> >>> > Also, what version of HBase/HDFS is this using? >>> > >>> > >>> > >>> > >>> > ----- Original Message ----- >>> > From: "Pamecha, Abhishek" <[EMAIL PROTECTED]> >>> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >>> > Cc +
lars hofhansl 2012-10-30, 05:14
-
Re: High IPC LatencyYousuf Ahmad 2012-10-30, 11:06
Hi Lars,
Thank you very much. With this fixed, I see the CPU at the regionserver become the bottleneck now. Today I will try 0.94 -- it has some optimizations at the regionserver for seeks -- to see if I can improve my benchmark numbers further. Regards, Yousuf On Tue, Oct 30, 2012 at 1:14 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Whoa... Why indeed? Maybe there was a historic reason. > I filed HBASE-7069 to fix it. > -- Lars > > ------------------------------ > *From:* Yousuf Ahmad <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED] > *Cc:* Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > [EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> > *Sent:* Monday, October 29, 2012 6:07 PM > > *Subject:* Re: High IPC Latency > > Hi guys, > > I have an update. Actually a question. > > @Override > public synchronized Object[] batch(final List<Row> actions) throws > InterruptedException, IOException { > Object[] results = new Object[actions.size()]; > connection.processBatch(actions, tableName, pool, results); > return results; > } > > This synchronized method in HTable seems to be causing contention among > concurrent multigets. > > I suppose my question is: why must we have this method synchronized? > > Thanks, > Yousuf > > > On Fri, Oct 19, 2012 at 2:21 PM, Yousuf Ahmad <[EMAIL PROTECTED]> wrote: > > No coprocessors :-) > On Oct 19, 2012 1:21 PM, "Pamecha, Abhishek" <[EMAIL PROTECTED]> wrote: > > Also, I hope no coprocessors are in play. > > Thanks, > Abhishek > > > -----Original Message----- > From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] > Sent: Friday, October 19, 2012 10:12 AM > To: [EMAIL PROTECTED] > Cc: Ivan Brondino; Ricardo Vilaça > Subject: Re: High IPC Latency > > Hi Lars, > > We are following your suggestion and testing against a single region > server. We just ran a test against a remote region server and soon we will > test against a local one as well. We will get back to you soon with the > results. > > It will take us a couple of days to port to and test our code with 0.94.2. > Once we have it working, we will run some experiments and update this > thread. > > Unfortunately, the nature of our project is such that we cannot easily > translate the benchmark's workload into a program executing the equivalent > HBase operations directly. For this reason, I attempted to roughly > translate the workload in terms of HBase operations in my first email and I > attached a portion of the logs to be a bit more concrete. > > Your assistance is very much appreciated! Thank you! We'll keep you > updated. > > Best regards, > Yousuf > > > On Fri, Oct 19, 2012 at 1:25 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > Can you reproduce this against a single, local region server? > > Any chance that you can try with the just released 0.94.2? > > > > > > I would love to debug this. If would be a tremendous help if you had a > > little test program that reproduces this against a single server, so > > that I can see what is going on. > > > > Thanks. > > > > -- Lars > > > > > > > > ----- Original Message ----- > > From: Yousuf Ahmad <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > > [EMAIL PROTECTED]> > > Sent: Thursday, October 18, 2012 12:59 PM > > Subject: Re: High IPC Latency > > > > Hi, > > > > Thank you for your questions guys. > > > > We are using HBase 0.92 with HDFS 1.0.1. > > > > The experiment lasts 15 minutes. The measurements stabilize in the > > first two minutes of the run. > > > > The data is distributed almost evenly across the regionservers so each > > client hits most of them over the course of the experiment. However, > > for the data we have, any given multi-get or scan should touch only > > one or at most two regions. > > > > The client caches the locations of the regionservers, so after a > > couple of minutes of the experiment running, it wouldn't need to > > re-visit ZooKeeper, I believe. Correct me if I am wrong please. +
Yousuf Ahmad 2012-10-30, 11:06
-
Re: High IPC LatencyVincent Barat 2012-11-20, 18:57
Hi,
We face a very similar issue since we upgraded from 0.90.3 to 0.92. I'll test the patch very soon, and I hope to see the RS CPU become the bottleneck as you did... I report on this tomorrow. Cheers, Le 30/10/12 12:06, Yousuf Ahmad a �crit : > Hi Lars, > > Thank you very much. > > With this fixed, I see the CPU at the regionserver become the bottleneck > now. Today I will try 0.94 -- it has some optimizations at the regionserver > for seeks -- to see if I can improve my benchmark numbers further. > > Regards, > Yousuf > > > On Tue, Oct 30, 2012 at 1:14 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > >> Whoa... Why indeed? Maybe there was a historic reason. >> I filed HBASE-7069 to fix it. >> -- Lars >> >> ------------------------------ >> *From:* Yousuf Ahmad <[EMAIL PROTECTED]> >> *To:* [EMAIL PROTECTED] >> *Cc:* Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vila�a < >> [EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> >> *Sent:* Monday, October 29, 2012 6:07 PM >> >> *Subject:* Re: High IPC Latency >> >> Hi guys, >> >> I have an update. Actually a question. >> >> @Override >> public synchronized Object[] batch(final List<Row> actions) throws >> InterruptedException, IOException { >> Object[] results = new Object[actions.size()]; >> connection.processBatch(actions, tableName, pool, results); >> return results; >> } >> >> This synchronized method in HTable seems to be causing contention among >> concurrent multigets. >> >> I suppose my question is: why must we have this method synchronized? >> >> Thanks, >> Yousuf >> >> >> On Fri, Oct 19, 2012 at 2:21 PM, Yousuf Ahmad <[EMAIL PROTECTED]> wrote: >> >> No coprocessors :-) >> On Oct 19, 2012 1:21 PM, "Pamecha, Abhishek" <[EMAIL PROTECTED]> wrote: >> >> Also, I hope no coprocessors are in play. >> >> Thanks, >> Abhishek >> >> >> -----Original Message----- >> From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] >> Sent: Friday, October 19, 2012 10:12 AM >> To: [EMAIL PROTECTED] >> Cc: Ivan Brondino; Ricardo Vila�a >> Subject: Re: High IPC Latency >> >> Hi Lars, >> >> We are following your suggestion and testing against a single region >> server. We just ran a test against a remote region server and soon we will >> test against a local one as well. We will get back to you soon with the >> results. >> >> It will take us a couple of days to port to and test our code with 0.94.2. >> Once we have it working, we will run some experiments and update this >> thread. >> >> Unfortunately, the nature of our project is such that we cannot easily >> translate the benchmark's workload into a program executing the equivalent >> HBase operations directly. For this reason, I attempted to roughly >> translate the workload in terms of HBase operations in my first email and I >> attached a portion of the logs to be a bit more concrete. >> >> Your assistance is very much appreciated! Thank you! We'll keep you >> updated. >> >> Best regards, >> Yousuf >> >> >> On Fri, Oct 19, 2012 at 1:25 AM, lars hofhansl <[EMAIL PROTECTED]> >> wrote: >> >>> Can you reproduce this against a single, local region server? >>> Any chance that you can try with the just released 0.94.2? >>> >>> >>> I would love to debug this. If would be a tremendous help if you had a >>> little test program that reproduces this against a single server, so >>> that I can see what is going on. >>> >>> Thanks. >>> >>> -- Lars >>> >>> >>> >>> ----- Original Message ----- >>> From: Yousuf Ahmad <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>> Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vila�a < >>> [EMAIL PROTECTED]> >>> Sent: Thursday, October 18, 2012 12:59 PM >>> Subject: Re: High IPC Latency >>> >>> Hi, >>> >>> Thank you for your questions guys. >>> >>> We are using HBase 0.92 with HDFS 1.0.1. >>> >>> The experiment lasts 15 minutes. The measurements stabilize in the >>> first two minutes of the run. >>> >>> The data is distributed almost evenly across the regionservers so each +
Vincent Barat 2012-11-20, 18:57
-
RE: High IPC LatencyRamkrishna.S.Vasudevan 2012-10-19, 04:38
Hi Yousuf
> The client caches the locations of the regionservers, so after a couple > of > minutes of the experiment running, it wouldn't need to re-visit > ZooKeeper, > I believe. Correct me if I am wrong please. Yes you are right. Regards Ram > -----Original Message----- > From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] > Sent: Friday, October 19, 2012 1:30 AM > To: [EMAIL PROTECTED]; lars hofhansl > Cc: Ivan Brondino; Ricardo Vilaça > Subject: Re: High IPC Latency > > Hi, > > Thank you for your questions guys. > > We are using HBase 0.92 with HDFS 1.0.1. > > The experiment lasts 15 minutes. The measurements stabilize in the > first > two minutes of the run. > > The data is distributed almost evenly across the regionservers so each > client hits most of them over the course of the experiment. However, > for > the data we have, any given multi-get or scan should touch only one or > at > most two regions. > > The client caches the locations of the regionservers, so after a couple > of > minutes of the experiment running, it wouldn't need to re-visit > ZooKeeper, > I believe. Correct me if I am wrong please. > > Regards, > Yousuf > > > On Thu, Oct 18, 2012 at 2:42 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > Also, what version of HBase/HDFS is this using? > > > > > > > > > > ----- Original Message ----- > > From: "Pamecha, Abhishek" <[EMAIL PROTECTED]> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > Cc: Ivan Brondino <[EMAIL PROTECTED]>; Ricardo Vilaça < > > [EMAIL PROTECTED]> > > Sent: Thursday, October 18, 2012 11:38 AM > > Subject: RE: High IPC Latency > > > > Is it sustained for the same client hitting the same region server OR > does > > it get better for the same client-RS combination when run for longer > > duration? Trying to eliminate Zookeeper from this. > > > > Thanks, > > Abhishek > > > > From: Yousuf Ahmad [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, October 18, 2012 11:26 AM > > To: [EMAIL PROTECTED] > > Cc: Ivan Brondino; Ricardo Vilaça > > Subject: High IPC Latency > > > > Hello, > > > > We are seeing slow times for read operations in our experiments. We > are > > hoping that you guys can help us figure out what's going wrong. > > > > Here are some details: > > > > * We are running a read-only benchmark on our HBase cluster. > > * > > * There are 10 regionservers, each co-located with a datanode. > HDFS > > replication is 3x. > > * All the data read by the experiment is already in the block > cache > > and the hit ratio is 99%. > > * > > * We have 10 clients, each with around 400 threads making a mix > of > > read-only requests involving multi-gets and scans. > > * > > * We settled on the default client pool type/size (roundrobin/1) > and a > > regionserver handler count of 100 after testing various combinations > to see > > what setting worked best. > > * > > * Our scans are short, fetching around 10 rows on average. > Scanner > > caching is set to 50. > > * An average row in a scan has either around 10 columns (small > row) or > > around 200 columns (big row). > > * > > * Our multi-gets fetch around 200 rows on average. > > * An average row in a multi-get has around 10 columns. > > * Each column holds an integer (encoded into bytes). > > * > > * None of the machines involved reach CPU, memory, or IO > saturation. > > In fact resource utilization stays quite low. > > * > > * Our statistics show that the average time for a scan, measured > > starting from the first scanner.next() call to the last one which > returns a > > null, is around 2-3 seconds. > > * Since we use scanner caching, the major portion of this time > (around > > 2 seconds) is spent on the first call to next(), while the remaining > calls > > take a negligible amount of time. > > * Similarly, we see that a multi-get on average takes around 2 > seconds. > > * A single get on average takes around 1 second. > > We are not sure what the bottleneck is or where it lies. We thought +
Ramkrishna.S.Vasudevan 2012-10-19, 04:38
|