Stuti Awasthi 2011-11-16, 04:39
-Re: Query on analyze big data with Hbase
Cosmin Lehene 2011-11-16, 10:22
You should consider looking over the available HBase resources
There's an online book http://hbase.apache.org/book.html
And there's Lars George's book from O'Reilly
On 11/16/11 6:39 AM, "Stuti Awasthi" <[EMAIL PROTECTED]> wrote:
>I have a scenario in which my Hbase tables will be fed with data size
>more than 250GB every day. I have to do analysis on that data using MR
>jobs and save the output in Hbase table itself.
>1. My concern is will Hbase be able to handle such data as it is
>built to handle big data?
Yes, it should be able to handle this amount of data. However you need to
determine the number of simultaneous requests and the size of each request
so you could determine the minimum number of region servers and their
You could do some testing on a small cluster once you decide the hardware
you're going to use.
>2. What hardware /hbase configuration points I must keep in mind to
>create a cluster for such requirement.?
It depends on the data access patterns: e.g. run a map-reduce job
incrementally on the new data or have the data available for lots of
It also depends on the desired duration of the map-reduce job or the
average latency you want for the random reads.
Generally depending on what you need you'll have to tune a core x spindle
x RAM formula.
If you have to few disks then you'll end up with a IO bottleneck, if you
add too many you'll either saturate the CPU or the NIC and have some disks
I'm not sure if a golden rule is what you should be relying on, but 1 core
x 1 spindle x 4 GB RAM is common so you can use this as a baseline and
Optimizing the map-reduce code will generally change things dramatically
You need to take bandwidth utilization into account as well: considering
that all data written through the HBase API will (optionally) initially go
a Write-Ahead Log (WAL) in HDFS that is replicated on 3 machines
in HRegionServer cache (RAM) - these are flushed to HDFS as well (3
One of the replicas will always be on the local machine (given that you
run DataNode and HRegionServer on same machines), but the two others will
go out on different machines.
>a. How many region server?
This depends a lot on the data access pattern and on the hardware that the
HRegionServer runs on (how much RAM, how many cores, how many spindles).
Normally if you don't access much the old data, than it's ok to keep it on
less region servers with more space as it won't take up resources.
>b. How many regions per region server ?
There are some points on this in the books. By default the region size is
250MB and it's configurable to larger sizes. Facebook has some interesting
points on this as well.
There's a balance between avoiding region splits (if a region grows larger
than the defined size it will be split in two) and having a good data
distribution on the cluster (e.g. if you have huge region and all the
writes go to that one you'll end up using a single region server for all
writes) - so you need to decide a good key distribution.
>3. My schema is such that in one table with one cf , there will be
>millions of column qualifier. What can be the consequences of such design.
It means that you need to make sure you're not exceeding the region size
with a single row.
You also have to consider that getting an entire row will be an expensive
operation. You should look at the batching options for Scans
(incrementally retrieve batches of columns from a row).
Again, testing is key :)
You could consider outputting the MR job output to a HFile that you can
load into HBase after, instead of going through the HBase API - especially
if the resulted data is large.
>HCL Comnet Systems and Services Ltd
>F-8/9 Basement, Sec-3,Noida.
Stuti Awasthi 2011-11-17, 05:54