Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - [ANN]: HBaseWD: Distribute Sequential Writes in HBase


Copy link to this message
-
Re: [ANN]: HBaseWD: Distribute Sequential Writes in HBase
Weishung Chung 2011-04-21, 14:39
Awesome, I need to try it out :) Thank you !

On Thu, Apr 21, 2011 at 9:23 AM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> Aha, so you want to "count" it as single scan (or just differently) when
> determining the load?
>
> The current code looks like this:
>
> class DistributedScanner:
>  public static DistributedScanner create(HTable hTable, Scan original,
> AbstractRowKeyDistributor keyDistributor) throws IOException {
>    byte[][] startKeys > keyDistributor.getAllDistributedKeys(original.getStartRow());
>    byte[][] stopKeys > keyDistributor.getAllDistributedKeys(original.getStopRow());
>    Scan[] scans = new Scan[startKeys.length];
>    for (byte i = 0; i < startKeys.length; i++) {
>      scans[i] = new Scan(original);
>      scans[i].setStartRow(startKeys[i]);
>      scans[i].setStopRow(stopKeys[i]);
>    }
>
>    ResultScanner[] rss = new ResultScanner[startKeys.length];
>    for (byte i = 0; i < scans.length; i++) {
>      rss[i] = hTable.getScanner(scans[i]);
>    }
>
>    return new DistributedScanner(rss);
>  }
>
> This is client code. To make these scans "identifiable" we need to either
> use some different (derived from Scan) class or add some attribute to them.
> There's no API for doing the latter. But we can do the former, but I don't
> really like the idea of creating extra class (with no extra functionality)
> just to distinguish it from the base one.
>
> If you can share why/how do you want to treat them differently on server
> side, that would be helpful.
>
> Alex Baranau
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
>
> On Thu, Apr 21, 2011 at 4:58 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > My request would be to make the distributed scan identifiable from server
> > side.
> > :-)
> >
> > On Thu, Apr 21, 2011 at 5:45 AM, Alex Baranau <[EMAIL PROTECTED]
> > >wrote:
> >
> > > > Basically bucketsCount may not equal number of regions for the
> > underlying
> > > > table.
> > >
> > > True: e.g. when there's only one region that holds data for the whole
> > table
> > > (not many records in table yet), distributed scan will fire N scans
> > against
> > > the same region.
> > > On the other hand, in case there are huge number of regions for single
> > > table, each scan can span over multiple regions.
> > >
> > > > I need to deal with normal scan and "distributed scan" at server
> side.
> > >
> > > With current implementation "distributed" scan won't be recognized as
> > > something special on the server side. It will be an ordinary scan.
> Though
> > > the number of scan will increase, given that the typical situation is
> > "many
> > > regions for single table", the scans of the same "distributed scan" are
> > > likely not to hit the same region.
> > >
> > > Not sure if I answered your questions here. Feel free to ask more ;)
> > >
> > > Alex Baranau
> > > ----
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
> > HBase
> > >
> > > On Wed, Apr 20, 2011 at 2:10 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > Alex:
> > > > If you read this, you would know why I asked:
> > > > https://issues.apache.org/jira/browse/HBASE-3679
> > > >
> > > > I need to deal with normal scan and "distributed scan" at server
> side.
> > > > Basically bucketsCount may not equal number of regions for the
> > underlying
> > > > table.
> > > >
> > > > Cheers
> > > >
> > > > On Tue, Apr 19, 2011 at 11:11 PM, Alex Baranau <
> > [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > Hi Ted,
> > > > >
> > > > > We currently use this tool in the scenario where data is consumed
> by
> > > > > MapReduce jobs, so we haven't tested the performance of pure
> > > "distributed
> > > > > scan" (i.e. N scans instead of 1) a lot. I expect it to be close to
> > > > simple
> > > > > scan performance, or may be sometimes even faster depending on your
> > > data
> > > > > access patterns. E.g. in case you write timeseries data
> (sequential)
> > > > which
> > > > > is written into the single region at a time, then e.g. if you