|
|
-
Rule of thumb: Size of data to send per RPC in a scan
David Koch 2013-01-25, 23:59
Hello,
Is there a rule to determine the best batch/caching combination for maximizing scan performance as a function of KV size and (average) number of columns per row key?
I have 0.5kb per value (constant), an average of 10 values per row key - heavy tailed so some outliers have 100k KVs, around 100million rows in the table. The cluster consists of 30 region servers, 24gb of RAM each, nodes are connecting with a 1gbit connection. I am running Map/Reduce jobs on the table, also with 30 task trackers.
I tried: cache: 1, no batching -> 14min cache 1000, batch 50 -> 11min cache 5000, batch 25 -> crash (timeouts) cache 2000, batch 25 -> 15min
Job time can vary quite significantly according to whatever activity (compactions?) are going on in the background. Also, I cannot probe for the best combination indefinitely since there actual production jobs queued. I did expect a larger speed-up with respect to no caching/batching at all - is this unjustified?
In short, I am looking for some tips for making scans in a Map/Reduce context faster :-)
Thank you,
/David
+
David Koch 2013-01-25, 23:59
-
Re: Rule of thumb: Size of data to send per RPC in a scan
Ted Yu 2013-01-26, 00:14
Looks like HBASE-2214 'Do HBASE-1996 -- setting size to return in scan rather than count of rows -- properly' may help you. But that is only in 0.96
Lars H presented some performance numbers in: HBASE-7008 Set scanner caching to a better default, disable Nagles where default for "hbase.client.scanner.caching" changed to 100
Cheers
On Fri, Jan 25, 2013 at 3:59 PM, David Koch <[EMAIL PROTECTED]> wrote:
> Hello, > > Is there a rule to determine the best batch/caching combination for > maximizing scan performance as a function of KV size and (average) number > of columns per row key? > > I have 0.5kb per value (constant), an average of 10 values per row key - > heavy tailed so some outliers have 100k KVs, around 100million rows in the > table. The cluster consists of 30 region servers, 24gb of RAM each, nodes > are connecting with a 1gbit connection. I am running Map/Reduce jobs on the > table, also with 30 task trackers. > > I tried: > cache: 1, no batching -> 14min > cache 1000, batch 50 -> 11min > cache 5000, batch 25 -> crash (timeouts) > cache 2000, batch 25 -> 15min > > Job time can vary quite significantly according to whatever activity > (compactions?) are going on in the background. Also, I cannot probe for the > best combination indefinitely since there actual production jobs queued. I > did expect a larger speed-up with respect to no caching/batching at all - > is this unjustified? > > In short, I am looking for some tips for making scans in a Map/Reduce > context faster :-) > > Thank you, > > /David >
+
Ted Yu 2013-01-26, 00:14
-
Re: Rule of thumb: Size of data to send per RPC in a scan
David Koch 2013-01-27, 22:25
Hello Ted,
Thank you for the link.
/David
On Sat, Jan 26, 2013 at 1:14 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> Looks like HBASE-2214 'Do HBASE-1996 -- setting size to return in scan > rather than count of rows -- properly' may help you. > But that is only in 0.96 > > Lars H presented some performance numbers in: > HBASE-7008 Set scanner caching to a better default, disable Nagles > where default for "hbase.client.scanner.caching" changed to 100 > > Cheers > > On Fri, Jan 25, 2013 at 3:59 PM, David Koch <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > Is there a rule to determine the best batch/caching combination for > > maximizing scan performance as a function of KV size and (average) number > > of columns per row key? > > > > I have 0.5kb per value (constant), an average of 10 values per row key - > > heavy tailed so some outliers have 100k KVs, around 100million rows in > the > > table. The cluster consists of 30 region servers, 24gb of RAM each, nodes > > are connecting with a 1gbit connection. I am running Map/Reduce jobs on > the > > table, also with 30 task trackers. > > > > I tried: > > cache: 1, no batching -> 14min > > cache 1000, batch 50 -> 11min > > cache 5000, batch 25 -> crash (timeouts) > > cache 2000, batch 25 -> 15min > > > > Job time can vary quite significantly according to whatever activity > > (compactions?) are going on in the background. Also, I cannot probe for > the > > best combination indefinitely since there actual production jobs queued. > I > > did expect a larger speed-up with respect to no caching/batching at all - > > is this unjustified? > > > > In short, I am looking for some tips for making scans in a Map/Reduce > > context faster :-) > > > > Thank you, > > > > /David > > >
+
David Koch 2013-01-27, 22:25
|
|