Lets take a step back.

If you’re writing your own code and you’re writing a m/r program, you will get one split per region.
If your scan doesn’t contain a start or stop row, you will scan every row in the table.

The splits provide parallelism.
So when you launch your job and you have 10 regions, you’ll have 10 splits.

Going from memory, if your scan has a start/stop row, then those regions where there is no data  (e.g. the region’s start row isn’t inside the scope of your scan) the mapper created will complete quickly  and no rows are scanned and returned in the result set.

I think what you’re looking for is already done for you.


The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB