Keep a separate Hadoop cluster only focus on analyse is a better way to go,
the HBase cluster is only for collecting data. You can use distcp to copy
data between the two cluster which is faster, your Hadoop task has to parse
the HFile format for reading data which can be done but need some coding,
i'm wondering if there is already some code that you can reuse to pare the
HFile format file.
On Wed, Nov 13, 2013 at 3:11 PM, Vincent Barat <[EMAIL PROTECTED]>wrote:
> We have done this kind of thing using HBase 0.92.1 + Pig, but we finally
> had to limit the size of the tables and move the biggest data to HDFS:
> loading data directly from HBase is much slower than from HDFS, and doing
> it using M/R overloads HBase region servers, since several maps jobs scan
> table regions at the same time: so the bigger your tables are, the higher
> the load is (usually Pig creates 1 map per region, I don't know about Hive).
> This may not be an issue if your HBase cluster is dedicated to this kind
> of job, but if you also have to ensure a good random read latency at the
> same time, forget it.
> Le 11/11/2013 13:10, JC a écrit :
> We are looking to use hbase as a transformation engine. In other words,
>> data already loaded into hbase, run some large calculation/aggregation on
>> that data and then load it back into a rdbms for our BI analytic tools to
>> use. I was curious about what the communities experience is on this and if
>> there are some best practices. Some thoughts we are kicking around is
>> Mapreduce 2 and Yarn and writing files to HDFS to be loaded into the
>> Not sure what all the pieces are needed for the complete application
>> Thanks in advance for your help,
>> View this message in context: http://apache-hbase.679495.n3.
>> Sent from the HBase User mailing list archive at Nabble.com.