|
|
Jean-Marc Spaggiari 2013-01-27, 16:51
Hi,
Let's imagine this scenario.
I want to store IPs with counters. And I want to have counters by groups of IPs. All of that will be calculated with MR jobs and stored in HBase.
Let's take some IPs and make sure they are ordered by adding some "0" when required.
037.113.031.119 058.022.018.176 058.022.159.151 109.169.201.076 109.169.201.150 109.254.019.140 122.031.039.016 122.224.005.210 178.137.167.041
I want to have counters for all "levels" of those IPs. Which mean for those groups.
Group 1: 037 058 109 122 178
Group 2:
037.113 058.022 109.169 109.254 122.031 122.224 178.167
Group 3:
037.113.031 058.022.018 058.022.159 109.169.201 109.254.019 122.031.039 122.224.005 178.137.167
And group 4 is the complete IPs list.
Each time I see an IP, I will increment the required values into the 4 groups.
What's the bests way to store that knowing that I want to be able to easily list all the entries (ranged based) from one group.
Option 1 is to have one table per group. 1CF, 1C Pros: Very easy to access, retrieve, etc. Cons: Will generate 4 tables
Option 2 is to have one table, but 1 CF per group. Pros: Only one table, easy access. Cons: Heard that we should try to keep CFs under 3. Might have bad performances impacts.
Option 3 is to have one table, one CF and one C per group. Pros: Only one table, only one CF. Cons: Access is less easy than option 1 and 2.
I think Option 2 is the worst one. Option 1 is very easy to implement. And for option 3, I don't see any benefit compared to option 1.
So I'm tempted to go with option 1, but I don't like the idea of multiplying the table.
Does anyone have any comment on which options might be the best one, or even proposed another option?
JM
-
Re: Tables vs CFs vs Cs
lars hofhansl 2013-01-27, 17:28
I might be missing something. Why don't just have a counter per IP and then aggregate at read time? If you wanted the total of the 058 group you'd start a scanner with "058" as start row and "058\0" as stop row. On the client you sum up the counter values. Similarly for the 109.169 group. Start with "109.169" and stop "109.169\0".
-- Lars
________________________________ From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Sent: Sunday, January 27, 2013 8:51 AM Subject: Tables vs CFs vs Cs Hi,
Let's imagine this scenario.
I want to store IPs with counters. And I want to have counters by groups of IPs. All of that will be calculated with MR jobs and stored in HBase.
Let's take some IPs and make sure they are ordered by adding some "0" when required.
037.113.031.119 058.022.018.176 058.022.159.151 109.169.201.076 109.169.201.150 109.254.019.140 122.031.039.016 122.224.005.210 178.137.167.041
I want to have counters for all "levels" of those IPs. Which mean for those groups.
Group 1: 037 058 109 122 178
Group 2:
037.113 058.022 109.169 109.254 122.031 122.224 178.167
Group 3:
037.113.031 058.022.018 058.022.159 109.169.201 109.254.019 122.031.039 122.224.005 178.137.167
And group 4 is the complete IPs list.
Each time I see an IP, I will increment the required values into the 4 groups.
What's the bests way to store that knowing that I want to be able to easily list all the entries (ranged based) from one group.
Option 1 is to have one table per group. 1CF, 1C Pros: Very easy to access, retrieve, etc. Cons: Will generate 4 tables
Option 2 is to have one table, but 1 CF per group. Pros: Only one table, easy access. Cons: Heard that we should try to keep CFs under 3. Might have bad performances impacts.
Option 3 is to have one table, one CF and one C per group. Pros: Only one table, only one CF. Cons: Access is less easy than option 1 and 2.
I think Option 2 is the worst one. Option 1 is very easy to implement. And for option 3, I don't see any benefit compared to option 1.
So I'm tempted to go with option 1, but I don't like the idea of multiplying the table.
Does anyone have any comment on which options might be the best one, or even proposed another option?
JM
-
Re: Tables vs CFs vs Cs
Jean-Marc Spaggiari 2013-01-27, 17:37
What I would like is to have a faster (direct?) access to the number of entries starting with "058".
For IPv4 it's 0 to 255, so working fine. For for IPv6, it can take a while to scan the full range and aggregate.
JM
2013/1/27, lars hofhansl <[EMAIL PROTECTED]>: > I might be missing something. Why don't just have a counter per IP and then > aggregate at read time? > If you wanted the total of the 058 group you'd start a scanner with "058" as > start row and "058\0" as stop row. On the client you sum up the counter > values. > Similarly for the 109.169 group. Start with "109.169" and stop "109.169\0". > > -- Lars > > > > ________________________________ > From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Sent: Sunday, January 27, 2013 8:51 AM > Subject: Tables vs CFs vs Cs > > Hi, > > Let's imagine this scenario. > > I want to store IPs with counters. And I want to have counters by > groups of IPs. All of that will be calculated with MR jobs and stored > in HBase. > > Let's take some IPs and make sure they are ordered by adding some "0" > when required. > > 037.113.031.119 > 058.022.018.176 > 058.022.159.151 > 109.169.201.076 > 109.169.201.150 > 109.254.019.140 > 122.031.039.016 > 122.224.005.210 > 178.137.167.041 > > I want to have counters for all "levels" of those IPs. Which mean for > those groups. > > Group 1: > 037 > 058 > 109 > 122 > 178 > > Group 2: > > 037.113 > 058.022 > 109.169 > 109.254 > 122.031 > 122.224 > 178.167 > > Group 3: > > 037.113.031 > 058.022.018 > 058.022.159 > 109.169.201 > 109.254.019 > 122.031.039 > 122.224.005 > 178.137.167 > > And group 4 is the complete IPs list. > > Each time I see an IP, I will increment the required values into the 4 > groups. > > What's the bests way to store that knowing that I want to be able to > easily list all the entries (ranged based) from one group. > > Option 1 is to have one table per group. 1CF, 1C > Pros: Very easy to access, retrieve, etc. > Cons: Will generate 4 tables > > Option 2 is to have one table, but 1 CF per group. > Pros: Only one table, easy access. > Cons: Heard that we should try to keep CFs under 3. Might have bad > performances impacts. > > Option 3 is to have one table, one CF and one C per group. > Pros: Only one table, only one CF. > Cons: Access is less easy than option 1 and 2. > > I think Option 2 is the worst one. Option 1 is very easy to implement. > And for option 3, I don't see any benefit compared to option 1. > > So I'm tempted to go with option 1, but I don't like the idea of > multiplying the table. > > Does anyone have any comment on which options might be the best one, > or even proposed another option? > > JM
-
Re: Tables vs CFs vs Cs
lars hofhansl 2013-01-27, 17:47
I see.
Why would the number of distinct IPs you see vary between IPv4 and IPv6? (I'm assuming you're counting access or something) Do you need the counts for individual IPs. If not you can pre-aggregate and only store (say) at the x.y.z level (harder for IPv6 obviously).
Can you could also store IPs and prefixes (networks) in the same table:
109 109.169 109.169.201 109.169.201.150 109.254 109.254.019 109.254.019.140
etc.
That may or may not have some nice properties based on your access patterns. Otherwise multiple tables seem fine.
-- Lars
________________________________ From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> Sent: Sunday, January 27, 2013 9:37 AM Subject: Re: Tables vs CFs vs Cs What I would like is to have a faster (direct?) access to the number of entries starting with "058".
For IPv4 it's 0 to 255, so working fine. For for IPv6, it can take a while to scan the full range and aggregate.
JM
2013/1/27, lars hofhansl <[EMAIL PROTECTED]>: > I might be missing something. Why don't just have a counter per IP and then > aggregate at read time? > If you wanted the total of the 058 group you'd start a scanner with "058" as > start row and "058\0" as stop row. On the client you sum up the counter > values. > Similarly for the 109.169 group. Start with "109.169" and stop "109.169\0". > > -- Lars > > > > ________________________________ > From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Sent: Sunday, January 27, 2013 8:51 AM > Subject: Tables vs CFs vs Cs > > Hi, > > Let's imagine this scenario. > > I want to store IPs with counters. And I want to have counters by > groups of IPs. All of that will be calculated with MR jobs and stored > in HBase. > > Let's take some IPs and make sure they are ordered by adding some "0" > when required. > > 037.113.031.119 > 058.022.018.176 > 058.022.159.151 > 109.169.201.076 > 109.169.201.150 > 109.254.019.140 > 122.031.039.016 > 122.224.005.210 > 178.137.167.041 > > I want to have counters for all "levels" of those IPs. Which mean for > those groups. > > Group 1: > 037 > 058 > 109 > 122 > 178 > > Group 2: > > 037.113 > 058.022 > 109.169 > 109.254 > 122.031 > 122.224 > 178.167 > > Group 3: > > 037.113.031 > 058.022.018 > 058.022.159 > 109.169.201 > 109.254.019 > 122.031.039 > 122.224.005 > 178.137.167 > > And group 4 is the complete IPs list. > > Each time I see an IP, I will increment the required values into the 4 > groups. > > What's the bests way to store that knowing that I want to be able to > easily list all the entries (ranged based) from one group. > > Option 1 is to have one table per group. 1CF, 1C > Pros: Very easy to access, retrieve, etc. > Cons: Will generate 4 tables > > Option 2 is to have one table, but 1 CF per group. > Pros: Only one table, easy access. > Cons: Heard that we should try to keep CFs under 3. Might have bad > performances impacts. > > Option 3 is to have one table, one CF and one C per group. > Pros: Only one table, only one CF. > Cons: Access is less easy than option 1 and 2. > > I think Option 2 is the worst one. Option 1 is very easy to implement. > And for option 3, I don't see any benefit compared to option 1. > > So I'm tempted to go with option 1, but I don't like the idea of > multiplying the table. > > Does anyone have any comment on which options might be the best one, > or even proposed another option? > > JM
-
Re: Tables vs CFs vs Cs
Jean-Marc Spaggiari 2013-01-27, 19:41
The numbers for the IPs will still be the same, but for IPv6 the range will be from 0 to 2^128.
So to get an idea of the Group1 total for one value, I will have to scan from 109 to 109\00 and count all the lignes, but there migth be 2^(3x128) raws in the range (worst case because of IPv6), which will take a while to scan.
So having all the values in the same table seems to be very difficult.
What's about having one column per group? All the columns from the same CF are in the same file, right? So to get all the values from Group 1, I might have to do MANY skip, which might impact the performances negatively?
I want to be able to display the x first group1 entries, then pickup one, and list the x first group2 entries starting by the picked up valued, and so on. (walking down the tree)
Really sound like the best way is to go with different tables... Also, regarding the agregation, I have 2 options.
I use the regular MR process. Map will remove the extra digits and reduce will count/agregate them. But I will need to run 3 MRs. One per group.
Or.
I can do one single MR job, with no reduce, and in the MAP in do increments based in G1, G2 and G3.
Option 1 pro is to use MR process completly. Option 2 pro is that it scans the table only once.
Sound like Option 2 is better, but maybe there is something I'm nissing?
JM
2013/1/27, lars hofhansl <[EMAIL PROTECTED]>: > I see. > > Why would the number of distinct IPs you see vary between IPv4 and IPv6? > (I'm assuming you're counting access or something) > Do you need the counts for individual IPs. If not you can pre-aggregate and > only store (say) at the x.y.z level (harder for IPv6 obviously). > > Can you could also store IPs and prefixes (networks) in the same table: > > 109 > 109.169 > 109.169.201 > 109.169.201.150 > 109.254 > 109.254.019 > 109.254.019.140 > > etc. > > That may or may not have some nice properties based on your access patterns. > Otherwise multiple tables seem fine. > > -- Lars > > > > ________________________________ > From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Sent: Sunday, January 27, 2013 9:37 AM > Subject: Re: Tables vs CFs vs Cs > > What I would like is to have a faster (direct?) access to the number > of entries starting with "058". > > For IPv4 it's 0 to 255, so working fine. For for IPv6, it can take a > while to scan the full range and aggregate. > > JM > > 2013/1/27, lars hofhansl <[EMAIL PROTECTED]>: >> I might be missing something. Why don't just have a counter per IP and >> then >> aggregate at read time? >> If you wanted the total of the 058 group you'd start a scanner with "058" >> as >> start row and "058\0" as stop row. On the client you sum up the counter >> values. >> Similarly for the 109.169 group. Start with "109.169" and stop >> "109.169\0". >> >> -- Lars >> >> >> >> ________________________________ >> From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> >> To: user <[EMAIL PROTECTED]> >> Sent: Sunday, January 27, 2013 8:51 AM >> Subject: Tables vs CFs vs Cs >> >> Hi, >> >> Let's imagine this scenario. >> >> I want to store IPs with counters. And I want to have counters by >> groups of IPs. All of that will be calculated with MR jobs and stored >> in HBase. >> >> Let's take some IPs and make sure they are ordered by adding some "0" >> when required. >> >> 037.113.031.119 >> 058.022.018.176 >> 058.022.159.151 >> 109.169.201.076 >> 109.169.201.150 >> 109.254.019.140 >> 122.031.039.016 >> 122.224.005.210 >> 178.137.167.041 >> >> I want to have counters for all "levels" of those IPs. Which mean for >> those groups. >> >> Group 1: >> 037 >> 058 >> 109 >> 122 >> 178 >> >> Group 2: >> >> 037.113 >> 058.022 >> 109.169 >> 109.254 >> 122.031 >> 122.224 >> 178.167 >> >> Group 3: >> >> 037.113.031 >> 058.022.018 >> 058.022.159 >> 109.169.201 >> 109.254.019 >> 122.031.039 >> 122.224.005 >> 178.137.167 >> >> And group 4 is the complete IPs list. >> >> Each time I see an IP, I will increment the required values into the 4
-
Re: Tables vs CFs vs Cs
Andrew Purtell 2013-01-28, 19:49
IPv6 can support up to 281,474,976,710,656 networks. Assuming you only want to group by networks, that is already a potentially very large keyspace. The *minimum* number of distinct addresses a V6 network can contain (the smallest advertisable prefix is /48) is 1,208,925,819,614,629,174,706,176. This is a bigger problem, because if you also are counting distinct addresses, then let's hope the observations you are counting within this space are very very sparse, or yeah, it may take a while to calculate that aggregate. I don't have a good answer for adjusting to the scale of IPv6, but old V4 notions of counting distinct addresses by address may no longer be useful. Consider a device on a /48. It could use a unique address for every packet and not exhaust it's network space for 383,093,657,352 years at the rate of 100Kpps. This is a pathological case (we assume malicious actors) but still the question is in V6 is it useful to use an address as a proxy for the identity of a unique endpoint? Counting by a product GUID instead would bring the size of the keyspace down into the millions of rows only. This seems a good alternate strategy. If you don't control the endpoint and still want to count unique conversations, I would determine the physical path between endpoints and construct an identifier based on that. Our planet is very small compared to the astronomical scale of V6. On Sun, Jan 27, 2013 at 9:37 AM, Jean-Marc Spaggiari < [EMAIL PROTECTED]> wrote:
> What I would like is to have a faster (direct?) access to the number > of entries starting with "058". > > For IPv4 it's 0 to 255, so working fine. For for IPv6, it can take a > while to scan the full range and aggregate. > > JM > > 2013/1/27, lars hofhansl <[EMAIL PROTECTED]>: > > I might be missing something. Why don't just have a counter per IP and > then > > aggregate at read time? > > If you wanted the total of the 058 group you'd start a scanner with > "058" as > > start row and "058\0" as stop row. On the client you sum up the counter > > values. > > Similarly for the 109.169 group. Start with "109.169" and stop > "109.169\0". > > > > -- Lars > > > > > > > > ________________________________ > > From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> > > To: user <[EMAIL PROTECTED]> > > Sent: Sunday, January 27, 2013 8:51 AM > > Subject: Tables vs CFs vs Cs > > > > Hi, > > > > Let's imagine this scenario. > > > > I want to store IPs with counters. And I want to have counters by > > groups of IPs. All of that will be calculated with MR jobs and stored > > in HBase. > > > > Let's take some IPs and make sure they are ordered by adding some "0" > > when required. > > > > 037.113.031.119 > > 058.022.018.176 > > 058.022.159.151 > > 109.169.201.076 > > 109.169.201.150 > > 109.254.019.140 > > 122.031.039.016 > > 122.224.005.210 > > 178.137.167.041 > > > > I want to have counters for all "levels" of those IPs. Which mean for > > those groups. > > > > Group 1: > > 037 > > 058 > > 109 > > 122 > > 178 > > > > Group 2: > > > > 037.113 > > 058.022 > > 109.169 > > 109.254 > > 122.031 > > 122.224 > > 178.167 > > > > Group 3: > > > > 037.113.031 > > 058.022.018 > > 058.022.159 > > 109.169.201 > > 109.254.019 > > 122.031.039 > > 122.224.005 > > 178.137.167 > > > > And group 4 is the complete IPs list. > > > > Each time I see an IP, I will increment the required values into the 4 > > groups. > > > > What's the bests way to store that knowing that I want to be able to > > easily list all the entries (ranged based) from one group. > > > > Option 1 is to have one table per group. 1CF, 1C > > Pros: Very easy to access, retrieve, etc. > > Cons: Will generate 4 tables > > > > Option 2 is to have one table, but 1 CF per group. > > Pros: Only one table, easy access. > > Cons: Heard that we should try to keep CFs under 3. Might have bad > > performances impacts. > > > > Option 3 is to have one table, one CF and one C per group. > > Pros: Only one table, only one CF. > > Cons: Access is less easy than option 1 and 2.
Best regards,
- Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
-
Re: Tables vs CFs vs Cs
Asaf Mesika 2013-01-28, 21:54
I would go on using the row-key, on one table.
= Row Key Structure <group-depth><A group><B group><C group><D group>
group-depth: 1..4, encoded as 1 byte A-D group; encoded as 1 byte and not as string
Examples: <1><192> <2><192><168> <3><192><168><1> <4><192><168><1><10>
Column Qualifier: "c" - stands for counters Column Qualifier: "t" - stands for total
When you get a request for 192.168.1.10, you need to increase 4 rows, so build 4 Increment objects ands send them to HBase using HTable.batch. Each Increment object will increase the "t" column.
When you scan, simply scan for the range based on the group. For example, all 192.168 group can get by fetch rows with prefix of <2><192><168> (each numbers is a byte in the byte array you compose as prefix). You'll get back at most 255 rows.
In IPv4 you can have , on a popular site, 6-7 million unique IPs in 10 minutes of traffic.
You can enhance it by having a column qualifier for each hour, by converting the epoch of that hour (long) into a byte array, on top of having that all-hours total counter. This way you can filter the traffic by range of dates/hours. On Sun, Jan 27, 2013 at 6:51 PM, Jean-Marc Spaggiari < [EMAIL PROTECTED]> wrote:
> Hi, > > Let's imagine this scenario. > > I want to store IPs with counters. And I want to have counters by > groups of IPs. All of that will be calculated with MR jobs and stored > in HBase. > > Let's take some IPs and make sure they are ordered by adding some "0" > when required. > > 037.113.031.119 > 058.022.018.176 > 058.022.159.151 > 109.169.201.076 > 109.169.201.150 > 109.254.019.140 > 122.031.039.016 > 122.224.005.210 > 178.137.167.041 > > I want to have counters for all "levels" of those IPs. Which mean for > those groups. > > Group 1: > 037 > 058 > 109 > 122 > 178 > > Group 2: > > 037.113 > 058.022 > 109.169 > 109.254 > 122.031 > 122.224 > 178.167 > > Group 3: > > 037.113.031 > 058.022.018 > 058.022.159 > 109.169.201 > 109.254.019 > 122.031.039 > 122.224.005 > 178.137.167 > > And group 4 is the complete IPs list. > > Each time I see an IP, I will increment the required values into the 4 > groups. > > What's the bests way to store that knowing that I want to be able to > easily list all the entries (ranged based) from one group. > > Option 1 is to have one table per group. 1CF, 1C > Pros: Very easy to access, retrieve, etc. > Cons: Will generate 4 tables > > Option 2 is to have one table, but 1 CF per group. > Pros: Only one table, easy access. > Cons: Heard that we should try to keep CFs under 3. Might have bad > performances impacts. > > Option 3 is to have one table, one CF and one C per group. > Pros: Only one table, only one CF. > Cons: Access is less easy than option 1 and 2. > > I think Option 2 is the worst one. Option 1 is very easy to implement. > And for option 3, I don't see any benefit compared to option 1. > > So I'm tempted to go with option 1, but I don't like the idea of > multiplying the table. > > Does anyone have any comment on which options might be the best one, > or even proposed another option? > > JM >
|
|