Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Efficient way to sample from large HBase table.


+
David Koch 2012-10-12, 15:04
Copy link to this message
-
RE: Efficient way to sample from large HBase table.
Although, I have no idea of your use case, I would be surprised if during sampling you want to stop exactly at the 1M mark.

Here is one approach you might use:
May be if you store the total count of rows separately say 90M, then you can randomly pick 1 in 90 rows in your MR job doing a global scan. If your key is uniformly distributed, you can use mod-ranges and prefix filters to achieve that. This way, you don't have to instrument your MR jobs to monitor current progress of jobs

A drawback with this approach though it that it is an full scan. But you may use the basic idea above and restrict global to somewhat limited scan for efficiency at the loss of sampling randomness.

hth,
Abhishek

-----Original Message-----
From: David Koch [mailto:[EMAIL PROTECTED]]
Sent: Friday, October 12, 2012 8:05 AM
To: [EMAIL PROTECTED]
Subject: Efficient way to sample from large HBase table.

Hello,

I need to sample 1million rows from a large HBase table. What is an efficient way of doing this?

I thought about a RandomRowFilter on a scan of the source table to get approximately the right amount of rows in combination with a Mapper.
However since MapReduce counters cannot be reliably retrieved while the job is running I would need an external counter to keep track of the number of sampled records and stop the job at 1 million.

A variation would be to apply a RandomRowFilter as well as a KeyOnlyFilter on the scan and then open a connection to the source table inside each mapper to retrieve the values for the row key.

If there is a simpler more efficient way I would be glad to hear about it.

Thank you,

/David
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB