Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Help in designing row key


+
Flavio Pompermaier 2013-07-02, 16:13
+
Ted Yu 2013-07-02, 16:25
+
Flavio Pompermaier 2013-07-02, 17:35
+
Ted Yu 2013-07-02, 17:53
+
Flavio Pompermaier 2013-07-03, 08:05
+
Mike Axiak 2013-07-03, 08:12
+
Flavio Pompermaier 2013-07-03, 09:14
+
Anoop John 2013-07-03, 09:58
+
James Taylor 2013-07-03, 10:33
+
Flavio Pompermaier 2013-07-03, 11:25
Copy link to this message
-
Re: Help in designing row key
Sure, but FYI Phoenix is not just faster, but much easier as well (as
this email chain shows).

On 07/03/2013 04:25 AM, Flavio Pompermaier wrote:
> No, I've never seen Phoenix, but it looks like a very useful project!
> However I don't have such strict performance issues in my use case, I just
> want to have balanced regions as much as possible.
> So I think that in this case I will still use Bytes concatenation if
> someone confirm I'm doing it in the right way.
>
>
> On Wed, Jul 3, 2013 at 12:33 PM, James Taylor <[EMAIL PROTECTED]>wrote:
>
>> Hi Flavio,
>> Have you had a look at Phoenix (https://github.com/**forcedotcom/phoenix<https://github.com/forcedotcom/phoenix>)?
>> It will allow you to model your multi-part row key like this:
>>
>> CREATE TABLE flavio.analytics (
>>      source INTEGER,
>>      type INTEGER,
>>      qual VARCHAR,
>>      hash VARCHAR,
>>      ts DATE
>>      CONSTRAINT pk PRIMARY KEY (source, type, qual, hash, ts) // Defines
>> columns that make up the row key
>> )
>>
>> Then you can issue SQL queries like this (to query for the last 7 days
>> worth of data):
>> SELECT * FROM flavio.analytics WHERE source IN (1,2,5) AND type IN (55,66)
>> AND ts > CURRENT_DATE() - 7
>>
>> This will internally take advantage of our SkipScan (http://phoenix-hbase.
>> **blogspot.com/2013/05/**demystifying-skip-scan-in-**phoenix.html<http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html>)
>> to jump through your key space similar to FuzzyRowFilter, but in parallel
>> from the client taking into account your region boundaries.
>>
>> Or do more complex GROUP BY queries like this (to aggregate over the last
>> 30 days worth of data, bucketized by day):
>> SELECT type,COUNT(*) FROM flavio.analytics WHERE ts > CURRENT_DATE() - 30
>> GROUP BY type,TRUNCATE(ts,'DAY')
>>
>> No need to worry about lexicographical sort order, flipping sign bits,
>> normalizing/padding integer values, and all the other nuances of working
>> with an API that works at the level of bytes. No need to write and manage
>> installation of your own coprocessors to make aggregation efficient,
>> perform topN queries, etc.
>>
>> HTH.
>>
>> Regards,
>> James
>> @JamesPlusPlus
>>
>>
>> On 07/03/2013 02:58 AM, Anoop John wrote:
>>
>>> When you make the RK and convert the int parts into byte[] ( Use
>>> org.apache.hadoop.hbase.util.**Bytes#toBytes(*int) *)  it will give 4
>>> bytes
>>> for every byte..  Be careful about the ordering...   When u convert a +ve
>>> and -ve integer into byte[] and u do Lexiographical compare (as done in
>>> HBase) u will see -ve number being greater than +ve..  If you dont have to
>>> do deal with -ve numbers no issues  :)
>>>
>>> Well when all the parts of the RK is of fixed width u will need any
>>> seperator??
>>>
>>> -Anoop-
>>>
>>> On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <[EMAIL PROTECTED]
>>>> wrote:
>>>   Yeah, I was thinking to use a normalization step in order to allow the
>>>> use
>>>> of FuzzyRowFilter but what is not clear to me is if integers must also be
>>>> normalized or not.
>>>> I will explain myself better. Suppose that i follow your advice and I
>>>> produce keys like:
>>>>    - 1|1|somehash|sometimestamp
>>>>    - 55|555|somehash|sometimestamp
>>>>
>>>> Whould they match the same pattern or do I have to normalize them to the
>>>> following?
>>>>    - 001|001|somehash|sometimestamp
>>>>    - 055|555|somehash|sometimestamp
>>>>
>>>> Moreover, I noticed that you used dots ('.') to separate things instead
>>>> of
>>>> pipe ('|')..is there a reason for that (maybe performance or whatever) or
>>>> is just your favourite separator?
>>>>
>>>> Best,
>>>> Flavio
>>>>
>>>>
>>>> On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <[EMAIL PROTECTED]> wrote:
>>>>
>>>>   I'm not sure if you're eliding this fact or not, but you'd be much
>>>>> better off if you used a fixed-width format for your keys. So in your
>>>>> example, you'd have:
>>>>>
>>>>> PATTERN: source(4-byte-int).type(4-**byte-int or smaller).fixed 128-bit
+
Flavio Pompermaier 2013-07-03, 10:20
+
Ted Yu 2013-07-03, 11:35
+
Asaf Mesika 2013-07-03, 21:23
+
Flavio Pompermaier 2013-07-04, 09:46