Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Tables vs CFs vs Cs


+
Jean-Marc Spaggiari 2013-01-27, 16:51
+
lars hofhansl 2013-01-27, 17:28
+
Jean-Marc Spaggiari 2013-01-27, 17:37
+
lars hofhansl 2013-01-27, 17:47
+
Jean-Marc Spaggiari 2013-01-27, 19:41
Copy link to this message
-
Re: Tables vs CFs vs Cs
IPv6 can support up to 281,474,976,710,656 networks. Assuming you only want
to group by networks, that is already a potentially very large keyspace.
The *minimum* number of distinct addresses a V6 network can contain (the
smallest advertisable prefix is /48) is 1,208,925,819,614,629,174,706,176.
This is a bigger problem, because if you also are counting distinct
addresses, then let's hope the observations you are counting within this
space are very very sparse, or yeah, it may take a while to calculate that
aggregate. I don't have a good answer for adjusting to the scale of IPv6,
but old V4 notions of counting distinct addresses by address may no longer
be useful. Consider a device on a /48. It could use a unique address for
every packet and not exhaust it's network space for 383,093,657,352 years
at the rate of 100Kpps. This is a pathological case (we assume malicious
actors) but still the question is in V6 is it useful to use an address as a
proxy for the identity of a unique endpoint? Counting by a product GUID
instead would bring the size of the keyspace down into the millions of rows
only. This seems a good alternate strategy. If you don't control the
endpoint and still want to count unique conversations, I would determine
the physical path between endpoints and construct an identifier based on
that. Our planet is very small compared to the astronomical scale of V6.
On Sun, Jan 27, 2013 at 9:37 AM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:

> What I would like is to have a faster (direct?) access to the number
> of entries starting with "058".
>
> For IPv4 it's 0 to 255, so working fine. For for IPv6, it can take a
> while to scan the full range and aggregate.
>
> JM
>
> 2013/1/27, lars hofhansl <[EMAIL PROTECTED]>:
> > I might be missing something. Why don't just have a counter per IP and
> then
> > aggregate at read time?
> > If you wanted the total of the 058 group you'd start a scanner with
> "058" as
> > start row and "058\0" as stop row. On the client you sum up the counter
> > values.
> > Similarly for the 109.169 group. Start with "109.169" and stop
> "109.169\0".
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Jean-Marc Spaggiari <[EMAIL PROTECTED]>
> > To: user <[EMAIL PROTECTED]>
> > Sent: Sunday, January 27, 2013 8:51 AM
> > Subject: Tables vs CFs vs Cs
> >
> > Hi,
> >
> > Let's imagine this scenario.
> >
> > I want to store IPs with counters. And I want to have counters by
> > groups of IPs. All of that will be calculated with MR jobs and stored
> > in HBase.
> >
> > Let's take some IPs and make sure they are ordered by adding some "0"
> > when required.
> >
> > 037.113.031.119
> > 058.022.018.176
> > 058.022.159.151
> > 109.169.201.076
> > 109.169.201.150
> > 109.254.019.140
> > 122.031.039.016
> > 122.224.005.210
> > 178.137.167.041
> >
> > I want to have counters for all "levels" of those IPs. Which mean for
> > those groups.
> >
> > Group 1:
> > 037
> > 058
> > 109
> > 122
> > 178
> >
> > Group 2:
> >
> > 037.113
> > 058.022
> > 109.169
> > 109.254
> > 122.031
> > 122.224
> > 178.167
> >
> > Group 3:
> >
> > 037.113.031
> > 058.022.018
> > 058.022.159
> > 109.169.201
> > 109.254.019
> > 122.031.039
> > 122.224.005
> > 178.137.167
> >
> > And group 4 is the complete IPs list.
> >
> > Each time I see an IP, I will increment the required values into the 4
> > groups.
> >
> > What's the bests way to store that knowing that I want to be able to
> > easily list all the entries (ranged based) from one group.
> >
> > Option 1 is to have one table per group. 1CF, 1C
> > Pros: Very easy to access, retrieve, etc.
> > Cons: Will generate 4  tables
> >
> > Option 2 is to have one table, but 1 CF per group.
> > Pros: Only one table, easy access.
> > Cons: Heard that we should try to keep CFs under 3. Might have bad
> > performances impacts.
> >
> > Option 3 is to have one table, one CF and one C per group.
> > Pros: Only one table, only one CF.
> > Cons: Access is less easy than option 1 and 2.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
+
Asaf Mesika 2013-01-28, 21:54
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB