Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Efficient way to sample from large HBase table.


Copy link to this message
-
Efficient way to sample from large HBase table.
Hello,

I need to sample 1million rows from a large HBase table. What is an
efficient way of doing this?

I thought about a RandomRowFilter on a scan of the source table to get
approximately the right amount of rows in combination with a Mapper.
However since MapReduce counters cannot be reliably retrieved while the job
is running I would need an external counter to keep track of the number of
sampled records and stop the job at 1 million.

A variation would be to apply a RandomRowFilter as well as a KeyOnlyFilter
on the scan and then open a connection to the source table inside each
mapper to retrieve the values for the row key.

If there is a simpler more efficient way I would be glad to hear about it.

Thank you,

/David
+
Pamecha, Abhishek 2012-10-12, 18:06
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB