-TableInputFormat vs. a map of table regions (data locality)
Saptarshi Guha 2010-11-18, 04:30
I'm fairly new to HBase and would appreciate your comments.
 One way compute across an HBase dataset would be to run as many
maps as regions,
for each map, run a scan across the region row limits (within the map
method). This approach does not use TableInputFormat.In the reduce (if needed),
directly write (using put) to the table.
 In the *second* approach I could use the TableInputFormat and
H1: As for TableOutputFormat, I think both approaches, performance-wise are
equivalent. Correct me if I'm wrong.
H2: As for TableInputFormat vs. approach. A quick glance through the
TableSplit source reveals location information. At first blush I can imagine in
approach  I scan from row_start to row_end all the data of which
resides on a computer different from the compute node on which the split is
being run. Since TableInputFormat (approach ) uses region information, my
guess (not sure at all) is that Hadoop Mapreduce will assign the computation to
the node where the region lies and so when the scan is issued the queries will
be issued against local data - achieving data locality. So it makes sense to
take advantage of (at the least) the TableSplit information.
Are my hypotheses correct?