-Re: multiple partial scans in the row
Ian Varley 2012-02-14, 18:01
Are your orderIds ordered? You say "a range of orderIds", which implies that (i.e. they're sequential numbers like 001, 002, etc, not hashes or random values). If so, then a single scan can hit the rows for multiple contiguous orderIds (you'd set the start and stop rows based on a prefix of the row key that's just the length of the orderid).
Another question: are the time ranges you're scanning a big or small proportion of all the rows for each order id? If you generally expect to return a majority of the rows per each order, then a single scan (starting with the lowest orderId, and proceeding to the highest) is possibly still a good fit. You can also apply timestamp filters (which enables an optimization to exclude storefiles that couldn't possibly contain values in that timestamp range); that only works if the timestamps on your cells match the timestamp in the row key.
Alternately, if you expect to return only a small portion of the records (i.e. you keep a lot of items with a wide range of timestamps in each orderId, but you only want to retrieve a small set of them), you might want to do one scan per orderId. You can choose how much parallelism to put into it by controlling that yourself (i.e. use a thread per scan on the client side); you could theoretically do a thread per order id, but of course, if you have a very large number of them, that could be harmful.
A regular expression doesn't get you past the fundamental requirement, which is that at the server side, it has to look at every row (excepting special optimizations like the timestamp one I mentioned above).
Your best bet is to implement it a couple ways, with real data, and see which ones seem to work the fastest.
On Feb 14, 2012, at 11:45 AM, James Young wrote:
I am pretty new to HBase and i am trying to understand the best
practice to do the scan based on two/multiple partial scans for the
For example, I have a row key like: orderId-timeStamp-item. The
orderId has nothing to with the timeStamp and i have a requirement to
scan rows for certain orderIds ( a range of orderIds) within certain
time period. I am not sure if it is possible to perform two
partial scan: one is for orderId and another one is for the timeStamp.
Also, doing regular expression on the row key might work out. But it
is more expensive. so I am wondering what would be the best practice
for solving such a problem.
Thanks in advance,