Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Key formats and very low cardinality leading fields


+
Eric Czech 2012-09-03, 17:31
+
Jean-Marc Spaggiari 2012-09-03, 18:31
+
Eric Czech 2012-09-03, 19:06
+
Jean-Marc Spaggiari 2012-09-03, 19:20
+
Eric Czech 2012-09-03, 19:58
+
Michael Segel 2012-09-04, 17:34
+
Eric Czech 2012-09-04, 17:56
+
Michael Segel 2012-09-04, 18:03
+
Eric Czech 2012-09-04, 18:51
Copy link to this message
-
Re: Key formats and very low cardinality leading fields
Uhm...

This isn't very good.
In terms of inserting, you will hit a single or small subset of regions.

This may not be that bad if you have enough data and the rows not all inserting in to the same region.

since you're hitting an index to pull rows one at a time, you could do this... if you know the exact record you want, you could hash the key and then you wouldn't have a problem of hot spotting.
On Sep 4, 2012, at 1:51 PM, Eric Czech <[EMAIL PROTECTED]> wrote:

> How does the data flow in to the system? One source at a time?
> Generally, it will be one source at a time where these rows are index entries built from MapReduce jobs
>
> The second field. Is it sequential?
> No, the index writes from the MapReduce jobs should dump some relatively small number of rows into HBase for each first field - second field combination but then move on to another first field - second field combination where the new second field is not ordered in any way relative to the old second field.
>
> How are you using the data when you pull it from the database?
> Not totally sure what specific use cases you might be asking after but in a more general sense, the indexed data will power our web platform (we aggregate and manage data for the music industry) as well as work as inputs to offline analytics processes.   I'm placing the design priority on the interaction with the web platform though, and the full row structure I'm intending to use is:
>
>
>
> This is similar to OpenTSDB and the service we provide is similar to what OpenTSDB was designed for, if that gives you a better sense of what I'd like to do with the data.
>
> On Tue, Sep 4, 2012 at 2:03 PM, Michael Segel <[EMAIL PROTECTED]> wrote:
> Eric,
>
> So here's the larger question...
> How does the data flow in to the system? One source at a time?
>
> The second field. Is it sequential? If not sequential, is it going to be some sort of incremental larger than a previous value? (Are you always inserting to the left side of the queue?
>
> How are you using the data when you pull it from the database?
>
> 'Hot spotting' may be unavoidable and depending on other factors, it may be a moot point.
>
>
> On Sep 4, 2012, at 12:56 PM, Eric Czech <[EMAIL PROTECTED]> wrote:
>
> > Longer term .. what's really going to happen is more like I'll have a first
> > field value of 1, 2, and maybe 3.  I won't know 4 - 10 for a while and
> > the *second
> > *value after each initial value will be, although highly unique, relatively
> > exclusive for a given first value.  This means that even if I didn't use
> > the leading prefix, I'd have more or less the same problem where all the
> > writes are going to the same region when I introduce a new set of second
> > values.
> >
> > In case the generalities are confusing, the prefix value is a data source
> > identifier and the second value is an identifier for entities within that
> > source.  The entity identifiers for a given source are likely to span
> > different numeric or alpha-numeric ranges, but they probably won't be the
> > same ranges across sources.  Also, I won't know all those ranges (or
> > sources for that matter) upfront.
> >
> > I'm concerned about the introduction of a new data source (= leading prefix
> > value) since the first writes will be to the same region and ideally I'd be
> > able to get a sense of how the second values are split for the new leading
> > prefix and split an HBase region to reflect that.  If that's not possible
> > or just turns out to be a pain, then I can live with the introduction of
> > the new prefix being a little slow until the regions split and distribute
> > effectively.
> >
> > That make sense?
> >
> > On Tue, Sep 4, 2012 at 1:34 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
> >
> >> I think you have to understand what happens as a table splits.
> >> If you have a composite key where the first field has the value between
> >> 0-9 and you pre-split your table, you will  have all of your 1's going to
+
Eric Czech 2012-09-05, 05:37
+
Tom Brown 2012-09-07, 03:05
+
Jean-Marc Spaggiari 2012-09-03, 20:11
+
Mohit Anchlia 2012-09-03, 21:19
+
Eric Czech 2012-09-04, 17:15
+
Jean-Marc Spaggiari 2012-09-04, 17:22
+
Eric Czech 2012-09-04, 17:31
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB