-Re: Tables vs CFs vs Cs
lars hofhansl 2013-01-27, 17:47
Why would the number of distinct IPs you see vary between IPv4 and IPv6? (I'm assuming you're counting access or something)
Do you need the counts for individual IPs. If not you can pre-aggregate and only store (say) at the x.y.z level (harder for IPv6 obviously).
Can you could also store IPs and prefixes (networks) in the same table:
That may or may not have some nice properties based on your access patterns. Otherwise multiple tables seem fine.
From: Jean-Marc Spaggiari <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
Sent: Sunday, January 27, 2013 9:37 AM
Subject: Re: Tables vs CFs vs Cs
What I would like is to have a faster (direct?) access to the number
of entries starting with "058".
For IPv4 it's 0 to 255, so working fine. For for IPv6, it can take a
while to scan the full range and aggregate.
2013/1/27, lars hofhansl <[EMAIL PROTECTED]>:
> I might be missing something. Why don't just have a counter per IP and then
> aggregate at read time?
> If you wanted the total of the 058 group you'd start a scanner with "058" as
> start row and "058\0" as stop row. On the client you sum up the counter
> Similarly for the 109.169 group. Start with "109.169" and stop "109.169\0".
> -- Lars
> From: Jean-Marc Spaggiari <[EMAIL PROTECTED]>
> To: user <[EMAIL PROTECTED]>
> Sent: Sunday, January 27, 2013 8:51 AM
> Subject: Tables vs CFs vs Cs
> Let's imagine this scenario.
> I want to store IPs with counters. And I want to have counters by
> groups of IPs. All of that will be calculated with MR jobs and stored
> in HBase.
> Let's take some IPs and make sure they are ordered by adding some "0"
> when required.
> I want to have counters for all "levels" of those IPs. Which mean for
> those groups.
> Group 1:
> Group 2:
> Group 3:
> And group 4 is the complete IPs list.
> Each time I see an IP, I will increment the required values into the 4
> What's the bests way to store that knowing that I want to be able to
> easily list all the entries (ranged based) from one group.
> Option 1 is to have one table per group. 1CF, 1C
> Pros: Very easy to access, retrieve, etc.
> Cons: Will generate 4 tables
> Option 2 is to have one table, but 1 CF per group.
> Pros: Only one table, easy access.
> Cons: Heard that we should try to keep CFs under 3. Might have bad
> performances impacts.
> Option 3 is to have one table, one CF and one C per group.
> Pros: Only one table, only one CF.
> Cons: Access is less easy than option 1 and 2.
> I think Option 2 is the worst one. Option 1 is very easy to implement.
> And for option 3, I don't see any benefit compared to option 1.
> So I'm tempted to go with option 1, but I don't like the idea of
> multiplying the table.
> Does anyone have any comment on which options might be the best one,
> or even proposed another option?