I need to sample 1million rows from a large HBase table. What is an
efficient way of doing this?
I thought about a RandomRowFilter on a scan of the source table to get
approximately the right amount of rows in combination with a Mapper.
However since MapReduce counters cannot be reliably retrieved while the job
is running I would need an external counter to keep track of the number of
sampled records and stop the job at 1 million.
A variation would be to apply a RandomRowFilter as well as a KeyOnlyFilter
on the scan and then open a connection to the source table inside each
mapper to retrieve the values for the row key.
If there is a simpler more efficient way I would be glad to hear about it.