Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Read access pattern


Copy link to this message
-
Re: Read access pattern
Asaf Mesika 2013-04-30, 05:49
Couple of raw implementation thoughts:

1. Change the schema
Take the timestamps inside the row. Rowkey is the hash(objectid), and
column qualifier is the LONG.MAX_VALUE - changeDate - getTime(). You can
even save it using Bytes.toBytes(ts) to save space - will always be 8
bytes, instead of the longer bytes string.

This will enable you to "view" all the timestamps related to a single
objectid in one place. The problem with placing TS in the rowkey is that
it's all over the place - spread across regions, so it's harder to get a
valid who is before who response (indexing), without paying a penalty on
insertion for keeping it up to date.

I have two ideas - one is expensive read and the other is expensive write.

Expensive read:
When you write, you write two columns for that row: one named
i_[Rounded-to-the-hour-timestamp] with value of 1 (dummy value), indicating
you have timestamps with this hour, and the other is your original column
named ts_[timestamp].
You can implement a Filter, which upon arriving at the required row, will
first start by reading all "hour" timestamps, so it can find out where to
jump in the ts_[timestamp] column. Upon arriving to the required hour
timestamp matching the one you are looking for, you can know which hour was
before it, thus you can jump to it (using the hint method in the Filter
interface). The read is expensive since you need to read all
i_[Rounded-to-the-hour-timestamp] columns in the worst case. Maybe you
relax it by saying I only look for 24 hours before the original column
hour, thus reducing it only to 24 read worst case.
The write is cheap, the read is not.

Expensive write:
You can keep a column named i, which maintains an encoded version of an
index for the hours, thus when you read, you achieve the correct before
hour on log(n) searching through it and then jump to the ts_[timestamp]
column.
The write will be expensive, since you need to read-modify-write this
column on each timestamp you write.  The read is sort of cheap.

2.
I though I had another option of using RegionObserver and
EndpointCoprocessor but the biggest problem is the the predecessor
timestamp may be in another region server. The first idea is more
implementable :)

On Mon, Apr 29, 2013 at 8:05 PM, <[EMAIL PROTECTED]> wrote:

>
> Thanx for the quick answer.
>
> > For the next key, I think you can simply use your current key as your
> > scanner first key. You will then find the one which is just after.
> > Then you will have to verify the MD5 hash to make sure it's still for
> > the same object.
> Right, this is basically easy.
>
> > First, if you know that you are storing data about every 10 seconds,
> > set the startRow with something like
> > getMD5AsHex(Bytes.toBytes(myObjectId)) + String.format("%19d\n",
> > (Long.MAX_VALUE - (changeDate.getTime() - 60000))) then ready the few
> > lines you will have until you find your current line, and keep the
> > last one.
>
> Actually it is impossible to know the timerange for which there will be a
> next entry
>
> >
> > Else, if you don't know, you will have to start with
> > scan.setStartRow(getMD5AsHex(Bytes.toBytes(myObjectId))); but you
> > might have to skip MANY lines before finding the right one. Do I don't
> > really recommend that.
>
> ouch, obviously not very efficient. I assume even with a filter ?
> > Message du 29/04/13 18:18
> > De : "Jean-Marc Spaggiari"
> > A : [EMAIL PROTECTED]
> > Copie à :
> > Objet : Re: Read access pattern
> >
> > Hum.
> >
> > For the next key, I think you can simply use your current key as your
> > scanner first key. You will then find the one which is just after.
> > Then you will have to verify the MD5 hash to make sure it's still for
> > the same object.
> >
> > scan.setStartRow(getMD5AsHex(Bytes.toBytes(myObjectId)) +
> > String.format("%19d\n", (Long.MAX_VALUE - changeDate.getTime())));
> >
> > If you want to find the one just before, quickly, I see 2 options.
> >
> > First, if you know that you are storing data about every 10 seconds,