Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> [ANN]: HBaseWD: Distribute Sequential Writes in HBase


+
Alex Baranau 2011-04-19, 17:25
+
Stack 2011-04-19, 17:29
+
Ted Yu 2011-04-20, 05:17
+
Alex Baranau 2011-04-20, 06:11
+
Ted Yu 2011-04-20, 11:10
+
Alex Baranau 2011-04-21, 12:45
+
Ted Yu 2011-04-21, 13:58
+
Alex Baranau 2011-04-21, 14:23
Copy link to this message
-
Re: [ANN]: HBaseWD: Distribute Sequential Writes in HBase
Awesome, I need to try it out :) Thank you !

On Thu, Apr 21, 2011 at 9:23 AM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> Aha, so you want to "count" it as single scan (or just differently) when
> determining the load?
>
> The current code looks like this:
>
> class DistributedScanner:
>  public static DistributedScanner create(HTable hTable, Scan original,
> AbstractRowKeyDistributor keyDistributor) throws IOException {
>    byte[][] startKeys > keyDistributor.getAllDistributedKeys(original.getStartRow());
>    byte[][] stopKeys > keyDistributor.getAllDistributedKeys(original.getStopRow());
>    Scan[] scans = new Scan[startKeys.length];
>    for (byte i = 0; i < startKeys.length; i++) {
>      scans[i] = new Scan(original);
>      scans[i].setStartRow(startKeys[i]);
>      scans[i].setStopRow(stopKeys[i]);
>    }
>
>    ResultScanner[] rss = new ResultScanner[startKeys.length];
>    for (byte i = 0; i < scans.length; i++) {
>      rss[i] = hTable.getScanner(scans[i]);
>    }
>
>    return new DistributedScanner(rss);
>  }
>
> This is client code. To make these scans "identifiable" we need to either
> use some different (derived from Scan) class or add some attribute to them.
> There's no API for doing the latter. But we can do the former, but I don't
> really like the idea of creating extra class (with no extra functionality)
> just to distinguish it from the base one.
>
> If you can share why/how do you want to treat them differently on server
> side, that would be helpful.
>
> Alex Baranau
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
>
> On Thu, Apr 21, 2011 at 4:58 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > My request would be to make the distributed scan identifiable from server
> > side.
> > :-)
> >
> > On Thu, Apr 21, 2011 at 5:45 AM, Alex Baranau <[EMAIL PROTECTED]
> > >wrote:
> >
> > > > Basically bucketsCount may not equal number of regions for the
> > underlying
> > > > table.
> > >
> > > True: e.g. when there's only one region that holds data for the whole
> > table
> > > (not many records in table yet), distributed scan will fire N scans
> > against
> > > the same region.
> > > On the other hand, in case there are huge number of regions for single
> > > table, each scan can span over multiple regions.
> > >
> > > > I need to deal with normal scan and "distributed scan" at server
> side.
> > >
> > > With current implementation "distributed" scan won't be recognized as
> > > something special on the server side. It will be an ordinary scan.
> Though
> > > the number of scan will increase, given that the typical situation is
> > "many
> > > regions for single table", the scans of the same "distributed scan" are
> > > likely not to hit the same region.
> > >
> > > Not sure if I answered your questions here. Feel free to ask more ;)
> > >
> > > Alex Baranau
> > > ----
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
> > HBase
> > >
> > > On Wed, Apr 20, 2011 at 2:10 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > Alex:
> > > > If you read this, you would know why I asked:
> > > > https://issues.apache.org/jira/browse/HBASE-3679
> > > >
> > > > I need to deal with normal scan and "distributed scan" at server
> side.
> > > > Basically bucketsCount may not equal number of regions for the
> > underlying
> > > > table.
> > > >
> > > > Cheers
> > > >
> > > > On Tue, Apr 19, 2011 at 11:11 PM, Alex Baranau <
> > [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > Hi Ted,
> > > > >
> > > > > We currently use this tool in the scenario where data is consumed
> by
> > > > > MapReduce jobs, so we haven't tested the performance of pure
> > > "distributed
> > > > > scan" (i.e. N scans instead of 1) a lot. I expect it to be close to
> > > > simple
> > > > > scan performance, or may be sometimes even faster depending on your
> > > data
> > > > > access patterns. E.g. in case you write timeseries data
> (sequential)
> > > > which
> > > > > is written into the single region at a time, then e.g. if you
+
Ted Yu 2011-04-21, 14:57
+
Alex Baranau 2011-04-21, 15:32
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB