Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> multiple partial scans in the row

Copy link to this message
Re: multiple partial scans in the row

Are your orderIds ordered? You say "a range of orderIds", which implies that (i.e. they're sequential numbers like 001, 002, etc, not hashes or random values). If so, then a single scan can hit the rows for multiple contiguous orderIds (you'd set the start and stop rows based on a prefix of the row key that's just the length of the orderid).

Another question: are the time ranges you're scanning a big or small proportion of all the rows for each order id? If you generally expect to return a majority of the rows per each order, then a single scan (starting with the lowest orderId, and proceeding to the highest) is possibly still a good fit. You can also apply timestamp filters (which enables an optimization to exclude storefiles that couldn't possibly contain values in that timestamp range); that only works if the timestamps on your cells match the timestamp in the row key.

Alternately, if you expect to return only a small portion of the records (i.e. you keep a lot of items with a wide range of timestamps in each orderId, but you only want to retrieve a small set of them), you might want to do one scan per orderId. You can choose how much parallelism to put into it by controlling that yourself (i.e. use a thread per scan on the client side); you could theoretically do a thread per order id, but of course, if you have a very large number of them, that could be harmful.

A regular expression doesn't get you past the fundamental requirement, which is that at the server side, it has to look at every row (excepting special optimizations like the timestamp one I mentioned above).

Your best bet is to implement it a couple ways, with real data, and see which ones seem to work the fastest.


On Feb 14, 2012, at 11:45 AM, James Young wrote:

Hi there,

I am pretty new to HBase and i am trying to understand the best
practice to do the scan based on two/multiple partial scans for the
row key.

For example, I have a row key like:  orderId-timeStamp-item. The
orderId has nothing to with the timeStamp and i have a requirement to
scan rows for certain orderIds ( a range of orderIds)  within certain
time period.    I am not sure if it is possible  to perform two
partial scan: one is for orderId and another one is for the timeStamp.

Also, doing regular expression on the row key might work out.  But it
is more expensive. so I am wondering what would be the best practice
for solving such a problem.
Thanks in advance,