Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Rule of thumb: Size of data to send per RPC in a scan


Copy link to this message
-
Re: Rule of thumb: Size of data to send per RPC in a scan
Hello Ted,

Thank you for the link.

/David

On Sat, Jan 26, 2013 at 1:14 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Looks like HBASE-2214 'Do HBASE-1996 -- setting size to return in scan
> rather than count of rows -- properly' may help you.
> But that is only in 0.96
>
> Lars H presented some performance numbers in:
>   HBASE-7008 Set scanner caching to a better default, disable Nagles
> where default for "hbase.client.scanner.caching" changed to 100
>
> Cheers
>
> On Fri, Jan 25, 2013 at 3:59 PM, David Koch <[EMAIL PROTECTED]> wrote:
>
> > Hello,
> >
> > Is there a rule to determine the best batch/caching combination for
> > maximizing scan performance as a function of KV size and (average) number
> > of columns per row key?
> >
> > I have 0.5kb per value (constant), an average of 10 values per row key -
> > heavy tailed so some outliers have 100k KVs, around 100million rows in
> the
> > table. The cluster consists of 30 region servers, 24gb of RAM each, nodes
> > are connecting with a 1gbit connection. I am running Map/Reduce jobs on
> the
> > table, also with 30 task trackers.
> >
> > I tried:
> > cache: 1, no batching -> 14min
> > cache 1000, batch 50 -> 11min
> > cache 5000, batch 25 -> crash (timeouts)
> > cache 2000, batch 25 -> 15min
> >
> > Job time can vary quite significantly according to whatever activity
> > (compactions?) are going on in the background. Also, I cannot probe for
> the
> > best combination indefinitely since there actual production jobs queued.
> I
> > did expect a larger speed-up with respect to no caching/batching at all -
> > is this unjustified?
> >
> > In short, I am looking for some tips for making scans in a Map/Reduce
> > context faster :-)
> >
> > Thank you,
> >
> > /David
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB