Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Efficient way to sample from large HBase table.

Copy link to this message
Efficient way to sample from large HBase table.

I need to sample 1million rows from a large HBase table. What is an
efficient way of doing this?

I thought about a RandomRowFilter on a scan of the source table to get
approximately the right amount of rows in combination with a Mapper.
However since MapReduce counters cannot be reliably retrieved while the job
is running I would need an external counter to keep track of the number of
sampled records and stop the job at 1 million.

A variation would be to apply a RandomRowFilter as well as a KeyOnlyFilter
on the scan and then open a connection to the source table inside each
mapper to retrieve the values for the row key.

If there is a simpler more efficient way I would be glad to hear about it.

Thank you,