Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase parallel scanner performance


Copy link to this message
-
Re: HBase parallel scanner performance
So in your step 2 you have the following:
FOREACH row IN TABLE alpha:
     SELECT something
     FROM TABLE alpha
     WHERE alpha.url = row.url

Right?
And you are wondering why you are getting timeouts?
...
...
And how long does it take to do a full table scan? ;-)
(there's more, but that's the first thing you should see...)

Try creating a second table where you invert the URL and key pair such that for each URL, you have a set of your alpha table's keys?

Then you have the following...
FOREACH row IN TABLE alpha:
   FETCH key-set FROM beta
   WHERE beta.rowkey = alpha.url

Note I use FETCH to signify that you should get a single row in response.

Does this make sense?
( your second table is actually and index of the URL column in your first table)

HTH

Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 19, 2012, at 5:43 AM, Narendra yadala <[EMAIL PROTECTED]> wrote:

> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32
> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
> for maintaining our cluster. I have a single tweets table in which we store
> the tweets, one tweet per row (it has millions of rows currently).
>
> Now I try to run a Java batch (not a map reduce) which does the following :
>
>   1. Open a scanner over the tweet table and read the tweets one after
>   another. I set scanner caching to 128 rows as higher scanner caching is
>   leading to ScannerTimeoutExceptions. I scan over the first 10k rows only.
>   2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
>   in that tweet and open another scanner over the tweets table to see who
>   else shared that link. This involves getting rows having that URL from the
>   entire table (not first 10k rows).
>   3. Do similar stuff as in step 2 for hashtags
>   (hashtagcolfamily:hashtagvalue).
>   4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
>   can be higher (thousands also) later.
>
>
> When I run this batch I got the GC issue which is specified here
> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
> Then I tried to turn on the MSLAB feature and changed the GC settings by
> specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
> Even after doing this, I am running into all kinds of IOExceptions
> and SocketTimeoutExceptions.
>
> This Java batch opens approximately 7*2 (14) scanners open at a point in
> time and still I am running into all kinds of troubles. I am wondering
> whether I can have thousands of parallel scanners with HBase when I need to
> scale.
>
> It would be great to know whether I can open thousands/millions of scanners
> in parallel with HBase efficiently.
>
> Thanks
> Narendra
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB