Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Optimizing bulk load performance


Copy link to this message
-
Re: Optimizing bulk load performance
Hi Harry,
I'm currently working on Map Reduce which also involves incremental bulk
load using the HFileOutputFormat and I see similar performance in the
reduce phase and I believe this is the reason. The KeyValues have to be
sorted before being written to the HFile. So, the reducer runs a
TotalOrderPartitioner<http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html>
to
sort your map output and depending on how much data there is to sort plus
the allocated memory, sorting could be a performance bottleneck. The number
of reducers = number of regions and that cannot be overridden in the job
config. I guess this is related to your issue.

Hope this helps.

On Wed, Oct 23, 2013 at 7:57 AM, Harry Waye <[EMAIL PROTECTED]> wrote:

> I'm trying to load data into hbase using HFileOutputFormat and incremental
> bulk load but am getting rather lackluster performance, 10h for ~0.5TB
> data, ~50000 blocks.  This is being loaded into a table that has 2
> families, 9 columns, 2500 regions and is ~10TB in size.  Keys are md5
> hashes and regions are pretty evenly spread.  The majority of time appears
> to be spend in the reduce phase, with the map phase completing very
> quickly.  The network doesn't appear to be saturated, but the load is
> consistently at 6 which is the number or reduce tasks per node.
>
> 12 hosts (6 cores, 2 disk as RAID0, 1GB eth, no one else on the rack).
>
> MR conf: 6 mappers, 6 reducers per node.
>
> I spoke to someone on IRC and they recommended reducing job output
> replication to 1, and reducing the number of mappers which I reduced to 2.
>  Reducing replication appeared not to make any difference, reducing
> reducers appeared just to slow the job down.  I'm going to have a look at
> running the benchmarks mentioned on Michael Noll's blog and see what that
> turns up.  I guess some questions I have are:
>
> How does the global number/size of blocks affect perf.?  (I have a lot of
> 10mb files, which are the input files)
>
> How does the job local number/size of input blocks affect perf.?
>
> What is actually happening in the reduce phase that requires so much CPU?
>  I assume the actual construction of HFiles isn't intensive.
>
> Ultimately, how can I improve performance?
> Thanks
>

--
Regards,
Premal Shah.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB