Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - multiple partial scans in the row

Copy link to this message
Re: multiple partial scans in the row
Ian Varley 2012-02-14, 18:01

Are your orderIds ordered? You say "a range of orderIds", which implies that (i.e. they're sequential numbers like 001, 002, etc, not hashes or random values). If so, then a single scan can hit the rows for multiple contiguous orderIds (you'd set the start and stop rows based on a prefix of the row key that's just the length of the orderid).

Another question: are the time ranges you're scanning a big or small proportion of all the rows for each order id? If you generally expect to return a majority of the rows per each order, then a single scan (starting with the lowest orderId, and proceeding to the highest) is possibly still a good fit. You can also apply timestamp filters (which enables an optimization to exclude storefiles that couldn't possibly contain values in that timestamp range); that only works if the timestamps on your cells match the timestamp in the row key.

Alternately, if you expect to return only a small portion of the records (i.e. you keep a lot of items with a wide range of timestamps in each orderId, but you only want to retrieve a small set of them), you might want to do one scan per orderId. You can choose how much parallelism to put into it by controlling that yourself (i.e. use a thread per scan on the client side); you could theoretically do a thread per order id, but of course, if you have a very large number of them, that could be harmful.

A regular expression doesn't get you past the fundamental requirement, which is that at the server side, it has to look at every row (excepting special optimizations like the timestamp one I mentioned above).

Your best bet is to implement it a couple ways, with real data, and see which ones seem to work the fastest.


On Feb 14, 2012, at 11:45 AM, James Young wrote:

Hi there,

I am pretty new to HBase and i am trying to understand the best
practice to do the scan based on two/multiple partial scans for the
row key.

For example, I have a row key like:  orderId-timeStamp-item. The
orderId has nothing to with the timeStamp and i have a requirement to
scan rows for certain orderIds ( a range of orderIds)  within certain
time period.    I am not sure if it is possible  to perform two
partial scan: one is for orderId and another one is for the timeStamp.

Also, doing regular expression on the row key might work out.  But it
is more expensive. so I am wondering what would be the best practice
for solving such a problem.
Thanks in advance,