Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Help in designing row key


+
Flavio Pompermaier 2013-07-02, 16:13
+
Ted Yu 2013-07-02, 16:25
+
Flavio Pompermaier 2013-07-02, 17:35
+
Ted Yu 2013-07-02, 17:53
+
Flavio Pompermaier 2013-07-03, 08:05
+
Mike Axiak 2013-07-03, 08:12
+
Flavio Pompermaier 2013-07-03, 09:14
+
Anoop John 2013-07-03, 09:58
+
James Taylor 2013-07-03, 10:33
+
Flavio Pompermaier 2013-07-03, 11:25
Copy link to this message
-
Re: Help in designing row key
Sure, but FYI Phoenix is not just faster, but much easier as well (as
this email chain shows).

On 07/03/2013 04:25 AM, Flavio Pompermaier wrote:
> No, I've never seen Phoenix, but it looks like a very useful project!
> However I don't have such strict performance issues in my use case, I just
> want to have balanced regions as much as possible.
> So I think that in this case I will still use Bytes concatenation if
> someone confirm I'm doing it in the right way.
>
>
> On Wed, Jul 3, 2013 at 12:33 PM, James Taylor <[EMAIL PROTECTED]>wrote:
>
>> Hi Flavio,
>> Have you had a look at Phoenix (https://github.com/**forcedotcom/phoenix<https://github.com/forcedotcom/phoenix>)?
>> It will allow you to model your multi-part row key like this:
>>
>> CREATE TABLE flavio.analytics (
>>      source INTEGER,
>>      type INTEGER,
>>      qual VARCHAR,
>>      hash VARCHAR,
>>      ts DATE
>>      CONSTRAINT pk PRIMARY KEY (source, type, qual, hash, ts) // Defines
>> columns that make up the row key
>> )
>>
>> Then you can issue SQL queries like this (to query for the last 7 days
>> worth of data):
>> SELECT * FROM flavio.analytics WHERE source IN (1,2,5) AND type IN (55,66)
>> AND ts > CURRENT_DATE() - 7
>>
>> This will internally take advantage of our SkipScan (http://phoenix-hbase.
>> **blogspot.com/2013/05/**demystifying-skip-scan-in-**phoenix.html<http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html>)
>> to jump through your key space similar to FuzzyRowFilter, but in parallel
>> from the client taking into account your region boundaries.
>>
>> Or do more complex GROUP BY queries like this (to aggregate over the last
>> 30 days worth of data, bucketized by day):
>> SELECT type,COUNT(*) FROM flavio.analytics WHERE ts > CURRENT_DATE() - 30
>> GROUP BY type,TRUNCATE(ts,'DAY')
>>
>> No need to worry about lexicographical sort order, flipping sign bits,
>> normalizing/padding integer values, and all the other nuances of working
>> with an API that works at the level of bytes. No need to write and manage
>> installation of your own coprocessors to make aggregation efficient,
>> perform topN queries, etc.
>>
>> HTH.
>>
>> Regards,
>> James
>> @JamesPlusPlus
>>
>>
>> On 07/03/2013 02:58 AM, Anoop John wrote:
>>
>>> When you make the RK and convert the int parts into byte[] ( Use
>>> org.apache.hadoop.hbase.util.**Bytes#toBytes(*int) *)  it will give 4
>>> bytes
>>> for every byte..  Be careful about the ordering...   When u convert a +ve
>>> and -ve integer into byte[] and u do Lexiographical compare (as done in
>>> HBase) u will see -ve number being greater than +ve..  If you dont have to
>>> do deal with -ve numbers no issues  :)
>>>
>>> Well when all the parts of the RK is of fixed width u will need any
>>> seperator??
>>>
>>> -Anoop-
>>>
>>> On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <[EMAIL PROTECTED]
>>>> wrote:
>>>   Yeah, I was thinking to use a normalization step in order to allow the
>>>> use
>>>> of FuzzyRowFilter but what is not clear to me is if integers must also be
>>>> normalized or not.
>>>> I will explain myself better. Suppose that i follow your advice and I
>>>> produce keys like:
>>>>    - 1|1|somehash|sometimestamp
>>>>    - 55|555|somehash|sometimestamp
>>>>
>>>> Whould they match the same pattern or do I have to normalize them to the
>>>> following?
>>>>    - 001|001|somehash|sometimestamp
>>>>    - 055|555|somehash|sometimestamp
>>>>
>>>> Moreover, I noticed that you used dots ('.') to separate things instead
>>>> of
>>>> pipe ('|')..is there a reason for that (maybe performance or whatever) or
>>>> is just your favourite separator?
>>>>
>>>> Best,
>>>> Flavio
>>>>
>>>>
>>>> On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <[EMAIL PROTECTED]> wrote:
>>>>
>>>>   I'm not sure if you're eliding this fact or not, but you'd be much
>>>>> better off if you used a fixed-width format for your keys. So in your
>>>>> example, you'd have:
>>>>>
>>>>> PATTERN: source(4-byte-int).type(4-**byte-int or smaller).fixed 128-bit
+
Flavio Pompermaier 2013-07-03, 10:20
+
Ted Yu 2013-07-03, 11:35
+
Asaf Mesika 2013-07-03, 21:23
+
Flavio Pompermaier 2013-07-04, 09:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB