Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Re: Get on a row with multiple columns


Copy link to this message
-
Re: Get on a row with multiple columns
lars hofhansl 2013-02-09, 07:41
Only somewhat related. Seeing the magic 40ms random read time there. Did you disable Nagle's?
(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in hbase-site.xml).

________________________________
From: Varun Sharma <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
Sent: Friday, February 8, 2013 10:45 PM
Subject: Re: Get on a row with multiple columns

The use case is like your twitter feed. Tweets from people u follow. When
someone unfollows, you need to delete a bunch of his tweets from the
following feed. So, its frequent, and we are essentially running into some
extreme corner cases like the one above. We need high write throughput for
this, since when someone tweets, we need to fanout the tweet to all the
followers. We need the ability to do fast deletes (unfollow) and fast adds
(follow) and also be able to do fast random gets - when a real user loads
the feed. I doubt we will able to play much with the schema here since we
need to support a bunch of use cases.

@lars: It does not take 30 seconds to place 300 delete markers. It takes 30
seconds to first find which of those 300 pins are in the set of columns
present - this invokes 300 gets and then place the appropriate delete
markers. Note that we can have tens of thousands of columns in a single row
so a single get is not cheap.

If we were to just place delete markers, that is very fast. But when
started doing that, our random read performance suffered because of too
many delete markers. The 90th percentile on random reads shot up from 40
milliseconds to 150 milliseconds, which is not acceptable for our usecase.

Thanks
Varun

On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Can you organize your columns and then delete by column family?
>
> deleteColumn without specifying a TS is expensive, since HBase first has
> to figure out what the latest TS is.
>
> Should be better in 0.94.1 or later since deletes are batched like Puts
> (still need to retrieve the latest version, though).
>
> In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
> let's specify a scan condition and then place specific delete marker for
> all KVs encountered.
>
>
> If you wanted to get really
> fancy, you could hook up a coprocessor to the compaction process and
> simply filter all KVs you no longer want (without ever placing any
> delete markers).
>
>
> Are you saying it takes 15 seconds to place 300 version delete markers?!
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Varun Sharma <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, February 8, 2013 10:05 PM
> Subject: Re: Get on a row with multiple columns
>
> We are given a set of 300 columns to delete. I tested two cases:
>
> 1) deleteColumns() - with the 's'
>
> This function simply adds delete markers for 300 columns, in our case,
> typically only a fraction of these columns are actually present - 10. After
> starting to use deleteColumns, we starting seeing a drop in cluster wide
> random read performance - 90th percentile latency worsened, so did 99th
> probably because of having to traverse delete markers. I attribute this to
> profusion of delete markers in the cluster. Major compactions slowed down
> by almost 50 percent probably because of having to clean out significantly
> more delete markers.
>
> 2) deleteColumn()
>
> Ended up with untolerable 15 second calls, which clogged all the handlers.
> Making the cluster pretty much unresponsive.
>
> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > For the 300 column deletes, can you show us how the Delete(s) are
> > constructed ?
> >
> > Do you use this method ?
> >
> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> > Thanks
> >
> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
> >
> > > So a Get call with multiple columns on a single row should be much
> faster
> > > than independent Get(s) on each of those columns for that row. I am