David Koch 2012-10-12, 15:04
Although, I have no idea of your use case, I would be surprised if during sampling you want to stop exactly at the 1M mark.
Here is one approach you might use:
May be if you store the total count of rows separately say 90M, then you can randomly pick 1 in 90 rows in your MR job doing a global scan. If your key is uniformly distributed, you can use mod-ranges and prefix filters to achieve that. This way, you don't have to instrument your MR jobs to monitor current progress of jobs
A drawback with this approach though it that it is an full scan. But you may use the basic idea above and restrict global to somewhat limited scan for efficiency at the loss of sampling randomness.
From: David Koch [mailto:[EMAIL PROTECTED]]
Sent: Friday, October 12, 2012 8:05 AM
To: [EMAIL PROTECTED]
Subject: Efficient way to sample from large HBase table.
I need to sample 1million rows from a large HBase table. What is an efficient way of doing this?
I thought about a RandomRowFilter on a scan of the source table to get approximately the right amount of rows in combination with a Mapper.
However since MapReduce counters cannot be reliably retrieved while the job is running I would need an external counter to keep track of the number of sampled records and stop the job at 1 million.
A variation would be to apply a RandomRowFilter as well as a KeyOnlyFilter on the scan and then open a connection to the source table inside each mapper to retrieve the values for the row key.
If there is a simpler more efficient way I would be glad to hear about it.