-Re: Tables vs CFs vs Cs
Jean-Marc Spaggiari 2013-01-27, 19:41
The numbers for the IPs will still be the same, but for IPv6 the range
will be from 0 to 2^128.
So to get an idea of the Group1 total for one value, I will have to
scan from 109 to 109\00 and count all the lignes, but there migth be
2^(3x128) raws in the range (worst case because of IPv6), which will
take a while to scan.
So having all the values in the same table seems to be very difficult.
What's about having one column per group? All the columns from the
same CF are in the same file, right? So to get all the values from
Group 1, I might have to do MANY skip, which might impact the
I want to be able to display the x first group1 entries, then pickup
one, and list the x first group2 entries starting by the picked up
valued, and so on. (walking down the tree)
Really sound like the best way is to go with different tables...
Also, regarding the agregation, I have 2 options.
I use the regular MR process. Map will remove the extra digits and
reduce will count/agregate them. But I will need to run 3 MRs. One per
I can do one single MR job, with no reduce, and in the MAP in do
increments based in G1, G2 and G3.
Option 1 pro is to use MR process completly.
Option 2 pro is that it scans the table only once.
Sound like Option 2 is better, but maybe there is something I'm nissing?
2013/1/27, lars hofhansl <[EMAIL PROTECTED]>:
> I see.
> Why would the number of distinct IPs you see vary between IPv4 and IPv6?
> (I'm assuming you're counting access or something)
> Do you need the counts for individual IPs. If not you can pre-aggregate and
> only store (say) at the x.y.z level (harder for IPv6 obviously).
> Can you could also store IPs and prefixes (networks) in the same table:
> That may or may not have some nice properties based on your access patterns.
> Otherwise multiple tables seem fine.
> -- Lars
> From: Jean-Marc Spaggiari <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
> Sent: Sunday, January 27, 2013 9:37 AM
> Subject: Re: Tables vs CFs vs Cs
> What I would like is to have a faster (direct?) access to the number
> of entries starting with "058".
> For IPv4 it's 0 to 255, so working fine. For for IPv6, it can take a
> while to scan the full range and aggregate.
> 2013/1/27, lars hofhansl <[EMAIL PROTECTED]>:
>> I might be missing something. Why don't just have a counter per IP and
>> aggregate at read time?
>> If you wanted the total of the 058 group you'd start a scanner with "058"
>> start row and "058\0" as stop row. On the client you sum up the counter
>> Similarly for the 109.169 group. Start with "109.169" and stop
>> -- Lars
>> From: Jean-Marc Spaggiari <[EMAIL PROTECTED]>
>> To: user <[EMAIL PROTECTED]>
>> Sent: Sunday, January 27, 2013 8:51 AM
>> Subject: Tables vs CFs vs Cs
>> Let's imagine this scenario.
>> I want to store IPs with counters. And I want to have counters by
>> groups of IPs. All of that will be calculated with MR jobs and stored
>> in HBase.
>> Let's take some IPs and make sure they are ordered by adding some "0"
>> when required.
>> I want to have counters for all "levels" of those IPs. Which mean for
>> those groups.
>> Group 1:
>> Group 2:
>> Group 3:
>> And group 4 is the complete IPs list.
>> Each time I see an IP, I will increment the required values into the 4