Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Hbase Scan - number of columns make the query performance way different


Copy link to this message
-
Re: Hbase Scan - number of columns make the query performance way different

Hi there, I don't know the specifics of your environment, but ...

http://hbase.apache.org/book.html#perf.reading
11.8.2. Scan Attribute Selection
Š describes paying attention to the number of columns you are returning,
particularly when using HBase as a MR source.  In short, returning only
the columns you need means you are reducing the data transferred between
the RS and the client and the number of KV's evaluated in the RS, etc.
On 9/13/12 10:12 AM, "Shengjie Min" <[EMAIL PROTECTED]> wrote:

>Hi,
>
>I found an interesting difference between hbase scan query.
>
>I have a hbase table which has a lot of columns in a single column family.
>eg. let's say I have a users table, then userid, username, email .... etc
>etc 15 fields all together are in the single columnFamily.
>
>if you are familiar with RDBMS,
>
>query 1: select * from users
>vs
>query 2: select userid, username from users
>
>in mysql, these two has a difference, the query 2 will be obviously
>faster,
>but two queries won't give you a huge difference from performance
>perspective.
>
>In Hbase, I noticed that:
>
>query 3: scan 'users',   // this is basically return me all 15 fields
>vs
>query 4: scan 'users', {COLUMNS=>['cf:userid','cf:username']}    // this
>is
>return me only two fields: userid , username
>
>query 3 here takes way longer than query 4, Given a big data set. In my
>test, I have around 1,000,000 user records. You are talking about query 3
>-
>100 secs VS query 4 - a few secs.
>
>
>Can anybody explain to me, why the width of the resultset in HBASE can
>impact the performance that much?
>
>
>Shengjie Min