Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Bulk Load question.


Copy link to this message
-
Re: Bulk Load question.
> Is there a way to split this
> across regions in the beginning?

Since you didn't mention your HBase version, I'm assuming you are
using 0.90.1 or later.
If so, yes, there is a way to pre-split the regions. See this:
http://hbase.apache.org/book/important_configurations.html#d0e1975

Also - as Harsh mentioned, the bulkload tool might be even better, so
take a look at that as well:
http://hbase.apache.org/bulk-loads.html

--Suraj

On Sat, Mar 19, 2011 at 8:48 AM, Vivek Krishna <[EMAIL PROTECTED]> wrote:
> I have around 20 GB of data to be dumped into a hbase table.
>
> Initially, I had a simple java program to put the values in a batch of
> (5000-10000) records.  I tried concurrent inserts and each insert took about
> 15 seconds to write.  Which is very slow and was taking ages.
>
> Next approach was to use importtsv, this started off with a set of maps and
> after few minutes, I started getting RetriesException and errors out in a
> while.
>
> Of these experiments, I noticed that the master node was handing all the
> traffic.  I understand that initially it dumps data in one node and then
> splits across multiple nodes as data comes in.  Is there a way to split this
> across regions in the beginning?
>
> Or any other thoughts on how to handle inserts of large amounts of data?
> Viv
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB