Let me explain our processing model:
1. We decided that our work in hbase should have a day granularity (i.e.
scan the rows between 20110304-20110301).
2. Once we persist a day data in hbase numerous scans work on this date. So
we want scans to be efficient --> the key should start with the date to
allow start/end key scanning.
3. we use map/reduce in order to aggregate data from our logs. each
aggregation is persisted as column family in hbase (in 1 map/reduce we
produce few aggregations/families). We add the date as a key prefix at the
reduce stage before inserting the row to hbase (we don't need it in the
4. We could use bulk loading (hence persisting the reduce result to sequence
file) but our hbase version (0.90.1) didn't support it.
5. Our main problem was that a single region was created for each day and it
took about an hour to write 10 million rows to this region.
6. The solution was to open empty regions according to the key distribution.
7. it worked. now 10 million rows are inserted in about 15 minutes (5
machines). which is good for us.
Hope this clarify things,
On Mon, Mar 28, 2011 at 2:11 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
> Sorry I'm a bit confused.
> First your title says using date as the key, yet what you're really doing
> is using date as part of the key.
> Second, you mention that you're adding date as part of the key in a reducer
> What exactly is your use case?
> There are very few use cases where you need a reduce phase when writing to
> There's been a couple of discussions on reducing the potential for hot
> spots. One person chastised me for saying that you really couldn't reduce a
> potential 'hot spot' for time series data. Hbase does cache rows and in our
> testing, when we had enough readers, we saw the cache getting used because
> our test data set wasn't large enough and when the number of simulated users
> randomly fetching rows got to be a certain point, you could see that the
> rows were being returned from a cache and not a physical i/o fetch.
> When optimizing HBase, you *must* look at a specific use case.
> Here's an example...
> In one system, our only fetch use case for the data was a simple get(). No
> start/stop scans. So we hashed our key to gain even distribution. No hot
> But this doesn't work well when you have start/stop key scans. Or when we
> wanted to fetch records for processing that were orthogonal to our row keys.
> There we had to do full table scans.
> One architect wanted to change schema design because it impacted our batch
> processing. Tried to tell him that the batch processing didn't matter and
> that getting a consistent get() time was more important.
> Adding 15-20 mins to a 2-3 hour batch job doesn't matter when you are
> designing a system that is supposed to deliver data in real time.
> My point is that by looking at the use case, we will be less efficient on
> inserts, but more efficient on fetches where we avoid hot spots.
> > Date: Mon, 28 Mar 2011 09:41:22 +0200
> > Subject: Re: using date as key
> > From: [EMAIL PROTECTED]
> > To: [EMAIL PROTECTED]
> > CC: [EMAIL PROTECTED]
> > Hi,
> > We insert a single day (about 10 million rows), but also support
> > consecutive days.
> > We actually add the date to the key only in the reducer phase (the date
> > comes from the configuration), so our mappers emit the key only.
> > I wonder if using the TotalOrderPartitioner will give us some more
> > improvement. Will test it soon....
> > Lior
> > On Mon, Mar 28, 2011 at 9:04 AM, Cosmin Lehene <[EMAIL PROTECTED]>
> > > Lior,
> > >
> > > If you already know the key distribution you can create all the regions
> > > advance.
> > > Are you inserting a single day or multiple days?
> > >
> > > 5X is a good improvement. Here are some more hints:
> > >
> > > Hadoop does a sort of the reduce keys before the actual reduce phase.