Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Rule of thumb: Size of data to send per RPC in a scan

Copy link to this message
Rule of thumb: Size of data to send per RPC in a scan

Is there a rule to determine the best batch/caching combination for
maximizing scan performance as a function of KV size and (average) number
of columns per row key?

I have 0.5kb per value (constant), an average of 10 values per row key -
heavy tailed so some outliers have 100k KVs, around 100million rows in the
table. The cluster consists of 30 region servers, 24gb of RAM each, nodes
are connecting with a 1gbit connection. I am running Map/Reduce jobs on the
table, also with 30 task trackers.

I tried:
cache: 1, no batching -> 14min
cache 1000, batch 50 -> 11min
cache 5000, batch 25 -> crash (timeouts)
cache 2000, batch 25 -> 15min

Job time can vary quite significantly according to whatever activity
(compactions?) are going on in the background. Also, I cannot probe for the
best combination indefinitely since there actual production jobs queued. I
did expect a larger speed-up with respect to no caching/batching at all -
is this unjustified?

In short, I am looking for some tips for making scans in a Map/Reduce
context faster :-)

Thank you,

Ted Yu 2013-01-26, 00:14
David Koch 2013-01-27, 22:25