Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Help in designing row key


+
Flavio Pompermaier 2013-07-02, 16:13
+
Ted Yu 2013-07-02, 16:25
+
Flavio Pompermaier 2013-07-02, 17:35
+
Ted Yu 2013-07-02, 17:53
+
Flavio Pompermaier 2013-07-03, 08:05
+
Mike Axiak 2013-07-03, 08:12
+
Flavio Pompermaier 2013-07-03, 09:14
Copy link to this message
-
Re: Help in designing row key
When you make the RK and convert the int parts into byte[] ( Use
org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4 bytes
for every byte..  Be careful about the ordering...   When u convert a +ve
and -ve integer into byte[] and u do Lexiographical compare (as done in
HBase) u will see -ve number being greater than +ve..  If you dont have to
do deal with -ve numbers no issues  :)

Well when all the parts of the RK is of fixed width u will need any
seperator??

-Anoop-

On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <[EMAIL PROTECTED]>wrote:

> Yeah, I was thinking to use a normalization step in order to allow the use
> of FuzzyRowFilter but what is not clear to me is if integers must also be
> normalized or not.
> I will explain myself better. Suppose that i follow your advice and I
> produce keys like:
>  - 1|1|somehash|sometimestamp
>  - 55|555|somehash|sometimestamp
>
> Whould they match the same pattern or do I have to normalize them to the
> following?
>  - 001|001|somehash|sometimestamp
>  - 055|555|somehash|sometimestamp
>
> Moreover, I noticed that you used dots ('.') to separate things instead of
> pipe ('|')..is there a reason for that (maybe performance or whatever) or
> is just your favourite separator?
>
> Best,
> Flavio
>
>
> On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <[EMAIL PROTECTED]> wrote:
>
> > I'm not sure if you're eliding this fact or not, but you'd be much
> > better off if you used a fixed-width format for your keys. So in your
> > example, you'd have:
> >
> > PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
> > hash.8-byte timestamp
> >
> > Example: \x00\x00\x00\x01\x00\x00\x02\x03....
> >
> > The advantage of this is not only that it's significantly less data
> > (remember your key is stored on each KeyValue), but also you can now
> > use FuzzyRowFilter and other techniques to quickly perform scans. The
> > disadvantage is that you have to normalize the source-> integer but I
> > find I can either store that in an enum or cache it for a long time so
> > it's not a big issue.
> >
> > -Mike
> >
> > On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <[EMAIL PROTECTED]
> >
> > wrote:
> > > Thank you very much for the great support!
> > > This is how I thought to design my key:
> > >
> > > PATTERN: source|type|qualifier|hash(name)|timestamp
> > > EXAMPLE:
> > > google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
> > >
> > > Do you think my key could be good for my scope (my search will be
> > > essentially by source or source|type)?
> > > Another point is that initially I will not have so many sources, so I
> > will
> > > probably have only google|* but in the next phases there could be more
> > > sources..
> > >
> > > Best,
> > > Flavio
> > >
> > > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > >> For #1, yes - the client receives less data after filtering.
> > >>
> > >> For #2, please take a look at TestMultiVersions
> > >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in
> 0.94)
> > >> for time range:
> > >>
> > >>     scan = new Scan();
> > >>
> > >>     scan.setTimeRange(1000L, Long.MAX_VALUE);
> > >> For row key selection, you need a filter. Take a look at
> > >> FuzzyRowFilter.java
> > >>
> > >> Cheers
> > >>
> > >> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
> > [EMAIL PROTECTED]
> > >> >wrote:
> > >>
> > >> >  Thanks for the reply! I thus have two questions more:
> > >> >
> > >> > 1) is it true that filtering on timestamps doesn't affect
> > performance..?
> > >> > 2) could you send me a little snippet of how you would do such a
> > filter
> > >> (by
> > >> > row key + timestamps)? For example get all rows whose key starts
> with
> > >> > 'someid-' and whose timestamps is greater than some timestamp?
> > >> >
> > >> > Best,
> > >> > Flavio
> > >> >
> > >> >
> > >> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >> >
> > >> > > bq. Using timestamp in row-keys is discouraged
+
James Taylor 2013-07-03, 10:33
+
Flavio Pompermaier 2013-07-03, 11:25
+
James Taylor 2013-07-03, 11:42
+
Flavio Pompermaier 2013-07-03, 10:20
+
Ted Yu 2013-07-03, 11:35
+
Asaf Mesika 2013-07-03, 21:23
+
Flavio Pompermaier 2013-07-04, 09:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB