Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Rule of thumb: Size of data to send per RPC in a scan


Copy link to this message
-
Rule of thumb: Size of data to send per RPC in a scan
Hello,

Is there a rule to determine the best batch/caching combination for
maximizing scan performance as a function of KV size and (average) number
of columns per row key?

I have 0.5kb per value (constant), an average of 10 values per row key -
heavy tailed so some outliers have 100k KVs, around 100million rows in the
table. The cluster consists of 30 region servers, 24gb of RAM each, nodes
are connecting with a 1gbit connection. I am running Map/Reduce jobs on the
table, also with 30 task trackers.

I tried:
cache: 1, no batching -> 14min
cache 1000, batch 50 -> 11min
cache 5000, batch 25 -> crash (timeouts)
cache 2000, batch 25 -> 15min

Job time can vary quite significantly according to whatever activity
(compactions?) are going on in the background. Also, I cannot probe for the
best combination indefinitely since there actual production jobs queued. I
did expect a larger speed-up with respect to no caching/batching at all -
is this unjustified?

In short, I am looking for some tips for making scans in a Map/Reduce
context faster :-)

Thank you,

/David
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB