Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> larger HFile block size for very wide row?


Copy link to this message
-
Re: larger HFile block size for very wide row?
Hi Ted and Vladimir, thanks!

I was thinking if using index is a good idea. My scan/get criteria is
something like "get all rows I inserted since end of yesterday". I may
have to use MapReduce + timeRange filter.

Lars and all, I will try to report back some performance data later.
Thanks for the help from you all.

Best regards,
Wei

---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan

From:   Ted Yu <[EMAIL PROTECTED]>
To:     "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>,
Date:   01/29/2014 04:37 PM
Subject:        Re: larger HFile block size for very wide row?

bq. table:family2 holds only row keys (no data) from  table:family1.

Wei:
You can designate family2 as essential column family so that family1 is
brought into heap when needed.
On Wed, Jan 29, 2014 at 1:33 PM, Vladimir Rodionov
<[EMAIL PROTECTED]>wrote:

> Yes, your row will be split by KV boundaries - no need to increase
default
> block size, except, probably, performance.
> You will need to try different sizes to find optimal performance in your
> use case.
> I would not use combination of scan & get on the same table:family with
> very large rows.
> Either some kind of secondary indexing is needed or do scan on different
> family (which has the same row keys)
>
> table:family1 holds original data
> table:family2 holds only row keys (no data) from  table:family1.
> Your scan will be MUCH faster in this case.
>
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [EMAIL PROTECTED]
>
> ________________________________________
> From: Wei Tan [[EMAIL PROTECTED]]
> Sent: Wednesday, January 29, 2014 12:52 PM
> To: [EMAIL PROTECTED]
> Subject: Re: larger HFile block size for very wide row?
>
> Sorry, 1000 columns, each 2K, so each row is 2M. I guess HBase will keep
a
> single KV (i.e., a column rather than a row) in a block, so a row will
> span multiple blocks?
>
> My scan pattern is: I will do range scan, find the matching row keys,
and
> fetch the whole row for each row that matches my criteria.
>
> Best regards,
> Wei
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan
>
>
>
> From:   lars hofhansl <[EMAIL PROTECTED]>
> To:     "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>,
> Date:   01/29/2014 03:49 PM
> Subject:        Re: larger HFile block size for very wide row?
>
>
>
> You 1000 columns? Not 1000k = 1m column, I assume.
> So you'll have 2MB KVs. That's a bit on the large side.
>
> HBase will "grow" the block to fit the KV into it. It means you have
> basically one block per KV.
> I guess you address these rows via point gets (GET), and do not
typically
> scan through them, right?
>
> Do you see any performance issues?
>
> -- Lars
>
>
>
> ________________________________
>  From: Wei Tan <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Wednesday, January 29, 2014 12:35 PM
> Subject: larger HFile block size for very wide row?
>
>
> Hi, I have a HBase table where each row has ~1000k columns, ~2K each. My
> table scan pattern is to use a row key filter but I need to fetch the
> whole row (~1000 k) columns back.
>
> Shall I set HFile block size to be larger than the default 64K?
> Thanks,
> Wei
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to
be
> read only by the individual or entity to whom this message is addressed.
If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any
form,
please
 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB