Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Tables vs CFs vs Cs


+
Jean-Marc Spaggiari 2013-01-27, 16:51
Copy link to this message
-
Re: Tables vs CFs vs Cs
lars hofhansl 2013-01-27, 17:28
I might be missing something. Why don't just have a counter per IP and then aggregate at read time?
If you wanted the total of the 058 group you'd start a scanner with "058" as start row and "058\0" as stop row. On the client you sum up the counter values.
Similarly for the 109.169 group. Start with "109.169" and stop "109.169\0".

-- Lars

________________________________
 From: Jean-Marc Spaggiari <[EMAIL PROTECTED]>
To: user <[EMAIL PROTECTED]>
Sent: Sunday, January 27, 2013 8:51 AM
Subject: Tables vs CFs vs Cs
 
Hi,

Let's imagine this scenario.

I want to store IPs with counters. And I want to have counters by
groups of IPs. All of that will be calculated with MR jobs and stored
in HBase.

Let's take some IPs and make sure they are ordered by adding some "0"
when required.

037.113.031.119
058.022.018.176
058.022.159.151
109.169.201.076
109.169.201.150
109.254.019.140
122.031.039.016
122.224.005.210
178.137.167.041

I want to have counters for all "levels" of those IPs. Which mean for
those groups.

Group 1:
037
058
109
122
178

Group 2:

037.113
058.022
109.169
109.254
122.031
122.224
178.167

Group 3:

037.113.031
058.022.018
058.022.159
109.169.201
109.254.019
122.031.039
122.224.005
178.137.167

And group 4 is the complete IPs list.

Each time I see an IP, I will increment the required values into the 4 groups.

What's the bests way to store that knowing that I want to be able to
easily list all the entries (ranged based) from one group.

Option 1 is to have one table per group. 1CF, 1C
Pros: Very easy to access, retrieve, etc.
Cons: Will generate 4  tables

Option 2 is to have one table, but 1 CF per group.
Pros: Only one table, easy access.
Cons: Heard that we should try to keep CFs under 3. Might have bad
performances impacts.

Option 3 is to have one table, one CF and one C per group.
Pros: Only one table, only one CF.
Cons: Access is less easy than option 1 and 2.

I think Option 2 is the worst one. Option 1 is very easy to implement.
And for option 3, I don't see any benefit compared to option 1.

So I'm tempted to go with option 1, but I don't like the idea of
multiplying the table.

Does anyone have any comment on which options might be the best one,
or even proposed another option?

JM
+
Jean-Marc Spaggiari 2013-01-27, 17:37
+
lars hofhansl 2013-01-27, 17:47
+
Jean-Marc Spaggiari 2013-01-27, 19:41
+
Andrew Purtell 2013-01-28, 19:49
+
Asaf Mesika 2013-01-28, 21:54