Jean-Marc Spaggiari 2013-01-27, 16:51
lars hofhansl 2013-01-27, 17:28
Jean-Marc Spaggiari 2013-01-27, 17:37
lars hofhansl 2013-01-27, 17:47
Jean-Marc Spaggiari 2013-01-27, 19:41
Andrew Purtell 2013-01-28, 19:49
I would go on using the row-key, on one table.
= Row Key Structure <group-depth><A group><B group><C group><D group>
group-depth: 1..4, encoded as 1 byte
A-D group; encoded as 1 byte and not as string
Column Qualifier: "c" - stands for counters
Column Qualifier: "t" - stands for total
When you get a request for 192.168.1.10, you need to increase 4 rows, so
build 4 Increment objects ands send them to HBase using HTable.batch. Each
Increment object will increase the "t" column.
When you scan, simply scan for the range based on the group. For example,
all 192.168 group can get by fetch rows with prefix of <2><192><168> (each
numbers is a byte in the byte array you compose as prefix). You'll get back
at most 255 rows.
In IPv4 you can have , on a popular site, 6-7 million unique IPs in 10
minutes of traffic.
You can enhance it by having a column qualifier for each hour, by
converting the epoch of that hour (long) into a byte array, on top of
having that all-hours total counter. This way you can filter the traffic by
range of dates/hours.
On Sun, Jan 27, 2013 at 6:51 PM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:
> Let's imagine this scenario.
> I want to store IPs with counters. And I want to have counters by
> groups of IPs. All of that will be calculated with MR jobs and stored
> in HBase.
> Let's take some IPs and make sure they are ordered by adding some "0"
> when required.
> I want to have counters for all "levels" of those IPs. Which mean for
> those groups.
> Group 1:
> Group 2:
> Group 3:
> And group 4 is the complete IPs list.
> Each time I see an IP, I will increment the required values into the 4
> What's the bests way to store that knowing that I want to be able to
> easily list all the entries (ranged based) from one group.
> Option 1 is to have one table per group. 1CF, 1C
> Pros: Very easy to access, retrieve, etc.
> Cons: Will generate 4 tables
> Option 2 is to have one table, but 1 CF per group.
> Pros: Only one table, easy access.
> Cons: Heard that we should try to keep CFs under 3. Might have bad
> performances impacts.
> Option 3 is to have one table, one CF and one C per group.
> Pros: Only one table, only one CF.
> Cons: Access is less easy than option 1 and 2.
> I think Option 2 is the worst one. Option 1 is very easy to implement.
> And for option 3, I don't see any benefit compared to option 1.
> So I'm tempted to go with option 1, but I don't like the idea of
> multiplying the table.
> Does anyone have any comment on which options might be the best one,
> or even proposed another option?