Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Hbase Scan - number of columns make the query performance way different


Copy link to this message
-
Re: Hbase Scan - number of columns make the query performance way different

Hi there, I don't know the specifics of your environment, but ...

http://hbase.apache.org/book.html#perf.reading
11.8.2. Scan Attribute Selection
Š describes paying attention to the number of columns you are returning,
particularly when using HBase as a MR source.  In short, returning only
the columns you need means you are reducing the data transferred between
the RS and the client and the number of KV's evaluated in the RS, etc.
On 9/13/12 10:12 AM, "Shengjie Min" <[EMAIL PROTECTED]> wrote:

>Hi,
>
>I found an interesting difference between hbase scan query.
>
>I have a hbase table which has a lot of columns in a single column family.
>eg. let's say I have a users table, then userid, username, email .... etc
>etc 15 fields all together are in the single columnFamily.
>
>if you are familiar with RDBMS,
>
>query 1: select * from users
>vs
>query 2: select userid, username from users
>
>in mysql, these two has a difference, the query 2 will be obviously
>faster,
>but two queries won't give you a huge difference from performance
>perspective.
>
>In Hbase, I noticed that:
>
>query 3: scan 'users',   // this is basically return me all 15 fields
>vs
>query 4: scan 'users', {COLUMNS=>['cf:userid','cf:username']}    // this
>is
>return me only two fields: userid , username
>
>query 3 here takes way longer than query 4, Given a big data set. In my
>test, I have around 1,000,000 user records. You are talking about query 3
>-
>100 secs VS query 4 - a few secs.
>
>
>Can anybody explain to me, why the width of the resultset in HBASE can
>impact the performance that much?
>
>
>Shengjie Min
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB