Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Help in designing row key


Copy link to this message
-
Re: Help in designing row key
When you make the RK and convert the int parts into byte[] ( Use
org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4 bytes
for every byte..  Be careful about the ordering...   When u convert a +ve
and -ve integer into byte[] and u do Lexiographical compare (as done in
HBase) u will see -ve number being greater than +ve..  If you dont have to
do deal with -ve numbers no issues  :)

Well when all the parts of the RK is of fixed width u will need any
seperator??

-Anoop-

On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <[EMAIL PROTECTED]>wrote:

> Yeah, I was thinking to use a normalization step in order to allow the use
> of FuzzyRowFilter but what is not clear to me is if integers must also be
> normalized or not.
> I will explain myself better. Suppose that i follow your advice and I
> produce keys like:
>  - 1|1|somehash|sometimestamp
>  - 55|555|somehash|sometimestamp
>
> Whould they match the same pattern or do I have to normalize them to the
> following?
>  - 001|001|somehash|sometimestamp
>  - 055|555|somehash|sometimestamp
>
> Moreover, I noticed that you used dots ('.') to separate things instead of
> pipe ('|')..is there a reason for that (maybe performance or whatever) or
> is just your favourite separator?
>
> Best,
> Flavio
>
>
> On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <[EMAIL PROTECTED]> wrote:
>
> > I'm not sure if you're eliding this fact or not, but you'd be much
> > better off if you used a fixed-width format for your keys. So in your
> > example, you'd have:
> >
> > PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
> > hash.8-byte timestamp
> >
> > Example: \x00\x00\x00\x01\x00\x00\x02\x03....
> >
> > The advantage of this is not only that it's significantly less data
> > (remember your key is stored on each KeyValue), but also you can now
> > use FuzzyRowFilter and other techniques to quickly perform scans. The
> > disadvantage is that you have to normalize the source-> integer but I
> > find I can either store that in an enum or cache it for a long time so
> > it's not a big issue.
> >
> > -Mike
> >
> > On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <[EMAIL PROTECTED]
> >
> > wrote:
> > > Thank you very much for the great support!
> > > This is how I thought to design my key:
> > >
> > > PATTERN: source|type|qualifier|hash(name)|timestamp
> > > EXAMPLE:
> > > google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
> > >
> > > Do you think my key could be good for my scope (my search will be
> > > essentially by source or source|type)?
> > > Another point is that initially I will not have so many sources, so I
> > will
> > > probably have only google|* but in the next phases there could be more
> > > sources..
> > >
> > > Best,
> > > Flavio
> > >
> > > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > >> For #1, yes - the client receives less data after filtering.
> > >>
> > >> For #2, please take a look at TestMultiVersions
> > >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in
> 0.94)
> > >> for time range:
> > >>
> > >>     scan = new Scan();
> > >>
> > >>     scan.setTimeRange(1000L, Long.MAX_VALUE);
> > >> For row key selection, you need a filter. Take a look at
> > >> FuzzyRowFilter.java
> > >>
> > >> Cheers
> > >>
> > >> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
> > [EMAIL PROTECTED]
> > >> >wrote:
> > >>
> > >> >  Thanks for the reply! I thus have two questions more:
> > >> >
> > >> > 1) is it true that filtering on timestamps doesn't affect
> > performance..?
> > >> > 2) could you send me a little snippet of how you would do such a
> > filter
> > >> (by
> > >> > row key + timestamps)? For example get all rows whose key starts
> with
> > >> > 'someid-' and whose timestamps is greater than some timestamp?
> > >> >
> > >> > Best,
> > >> > Flavio
> > >> >
> > >> >
> > >> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >> >
> > >> > > bq. Using timestamp in row-keys is discouraged