HBase, mail # user - Efficient way to sample from large HBase table.

Efficient way to sample from large HBase table.
David Koch 2012-10-12, 15:04

I need to sample 1million rows from a large HBase table. What is an
efficient way of doing this?

I thought about a RandomRowFilter on a scan of the source table to get
approximately the right amount of rows in combination with a Mapper.
However since MapReduce counters cannot be reliably retrieved while the job
is running I would need an external counter to keep track of the number of
sampled records and stop the job at 1 million.

A variation would be to apply a RandomRowFilter as well as a KeyOnlyFilter
on the scan and then open a connection to the source table inside each
mapper to retrieve the values for the row key.

If there is a simpler more efficient way I would be glad to hear about it.

Thank you,