Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Hbase Scan - number of columns make the query performance way different


+
Shengjie Min 2012-09-13, 14:12
+
Doug Meil 2012-09-13, 14:29
+
Shengjie Min 2012-09-13, 14:35
Copy link to this message
-
Re: Hbase Scan - number of columns make the query performance way different
Not sure of your schema...

Each column family is in a separate collection of StoreFiles. Scan all will
read all these files whereas your second scan will only read the StoreFiles
associated with column family cf (difference if you have multiple column
families).  Additionally, pushing a large amount of data from region
servers to wherever you're running the shell will slow things down.

It is difficult to respond to this unless you reveal your entire data
structure and nature as well as your deployment scenario.

Jacques

On Thu, Sep 13, 2012 at 7:35 AM, Shengjie Min <[EMAIL PROTECTED]> wrote:

> In my case, I am not feeding hbase result to mapred, it's just pure hbase
> scan, returning all columns vs two columns makes huge difference to me.
>
> On 13 September 2012 15:29, Doug Meil <[EMAIL PROTECTED]>
> wrote:
>
> >
> > Hi there, I don't know the specifics of your environment, but ...
> >
> > http://hbase.apache.org/book.html#perf.reading
> > 11.8.2. Scan Attribute Selection
> >
> >
> > Š describes paying attention to the number of columns you are returning,
> > particularly when using HBase as a MR source.  In short, returning only
> > the columns you need means you are reducing the data transferred between
> > the RS and the client and the number of KV's evaluated in the RS, etc.
> >
> >
> >
> >
> > On 9/13/12 10:12 AM, "Shengjie Min" <[EMAIL PROTECTED]> wrote:
> >
> > >Hi,
> > >
> > >I found an interesting difference between hbase scan query.
> > >
> > >I have a hbase table which has a lot of columns in a single column
> family.
> > >eg. let's say I have a users table, then userid, username, email ....
> etc
> > >etc 15 fields all together are in the single columnFamily.
> > >
> > >if you are familiar with RDBMS,
> > >
> > >query 1: select * from users
> > >vs
> > >query 2: select userid, username from users
> > >
> > >in mysql, these two has a difference, the query 2 will be obviously
> > >faster,
> > >but two queries won't give you a huge difference from performance
> > >perspective.
> > >
> > >In Hbase, I noticed that:
> > >
> > >query 3: scan 'users',   // this is basically return me all 15 fields
> > >vs
> > >query 4: scan 'users', {COLUMNS=>['cf:userid','cf:username']}    // this
> > >is
> > >return me only two fields: userid , username
> > >
> > >query 3 here takes way longer than query 4, Given a big data set. In my
> > >test, I have around 1,000,000 user records. You are talking about query
> 3
> > >-
> > >100 secs VS query 4 - a few secs.
> > >
> > >
> > >Can anybody explain to me, why the width of the resultset in HBASE can
> > >impact the performance that much?
> > >
> > >
> > >Shengjie Min
> >
> >
> >
>
>
> --
> All the best,
> Shengjie Min
>
+
Alex Baranau 2012-09-17, 17:21
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB