Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Optimizing bulk load performance


+
Harry Waye 2013-10-23, 14:57
+
Jean-Marc Spaggiari 2013-10-24, 14:14
+
Harry Waye 2013-10-24, 15:16
+
Harry Waye 2013-10-24, 17:44
Copy link to this message
-
Re: Optimizing bulk load performance
Can you try vmstat 2? 2 is the interval in seconds it will display the disk
usage. On the extract here, nothing is running. only 8% is used. (1% disk
IO, 6% User, 1% sys)

Run it on 2 or 3 different nodes while you are putting the load on the
cluster. And take a look at the 4 last numbers and see what the value of
the last one?

On the usercpu0 graph, who is the gray guy showing hight?

JM

2013/10/24 Harry Waye <[EMAIL PROTECTED]>

> Ok I'm running a load job atm, I've add some possibly incomprehensible
> coloured lines to the graph: http://goo.gl/cUGCGG
>
> This is actually with one fewer nodes due to decommissioning to replace a
> disk, hence I guess the reason for one squiggly line showing no disk
> activity.  I've included only the cpu stats for CPU0 from each node.  The
> last graph should read "Memory Used".  vmstat from one of the nodes:
>
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa
>  6  0      0 392448 524668 43823900    0    0   501  1044    0    0  6  1
> 91  1
>
> To me the wait doesn't seem that high.  Job stats are
> http://goo.gl/ZYdUKp,  the job setup is
> https://gist.github.com/hazzadous/ac57a384f2ab685f07f6
>
> Does anything jump out at you?
>
> Cheers
> H
>
>
> On 24 October 2013 16:16, Harry Waye <[EMAIL PROTECTED]> wrote:
>
> > Hi JM
> >
> > I took a snapshot on the initial run, before the changes:
> >
> https://www.evernote.com/shard/s95/sh/b8e1516d-7c49-43f0-8b5f-d16bbdd3fe13/00d7c6cd6dd9fba92d6f00f90fb54fc1/res/4f0e20a2-1ecb-4085-8bc8-b3263c23afb5/screenshot.png
> >
> > Good timing, disks appear to be exploding (ATA errors) atm thus I'm
> > decommissioning and reprovisioning with new disks.  I'll be
> reprovisioning
> > as without RAID (it's software RAID just to compound the issue) although
> > not sure how I'll go about migrating all nodes.  I guess I'd need to put
> > more correctly speced nodes in the rack and decommission the existing.
> >  Makes diff. to
> >
> > We're using hetzner at the moment which may not have been a good choice.
> >  Has anyone had any experience with them wrt. Hadoop?  They offer 7 and
> 15
> > disk options, but are low on the cpu front (quad core).  Our workload
> will
> > be I assume on the high side.  There's also a 8 disk Dell PowerEdge what
> is
> > a little more powerful.  What hosting providers would people recommended?
> >  (And what would be the strategy for migrating?)
> >
> > Anyhow, when I have things more stable I'll have a look at checking out
> > what's using the cpu.  In the mean time, can anything be gleamed from the
> > above snap?
> >
> > Cheers
> > H
> >
> >
> > On 24 October 2013 15:14, Jean-Marc Spaggiari <[EMAIL PROTECTED]
> >wrote:
> >
> >> Hi Harry,
> >>
> >> Do you have more details on the exact load? Can you run vmstats and see
> >> what kind of load it is? Is it user? cpu? wio?
> >>
> >> I suspect your disks to be the issue. There is 2 things here.
> >>
> >> First, we don't recommend RAID for the HDFS/HBase disk. The best is to
> >> simply mount the disks on 2 mounting points and give them to HDFS.
> >> Second, 2 disks per not is very low. On a dev cluster is not even
> >> recommended. In production, you should go with 12 or more.
> >>
> >> So with only 2 disks in RAID, I suspect your WIO to be high which is
> what
> >> might slow your process.
> >>
> >> Can you take a look on that direction? If it's not that, we will
> continue
> >> to investigate ;)
> >>
> >> Thanks,
> >>
> >> JM
> >>
> >>
> >> 2013/10/23 Harry Waye <[EMAIL PROTECTED]>
> >>
> >> > I'm trying to load data into hbase using HFileOutputFormat and
> >> incremental
> >> > bulk load but am getting rather lackluster performance, 10h for ~0.5TB
> >> > data, ~50000 blocks.  This is being loaded into a table that has 2
> >> > families, 9 columns, 2500 regions and is ~10TB in size.  Keys are md5
> >> > hashes and regions are pretty evenly spread.  The majority of time
+
Harry Waye 2013-10-24, 21:02
+
Harry Waye 2013-10-24, 21:04
+
Jean-Marc Spaggiari 2013-10-24, 21:28
+
Harry Waye 2013-10-24, 21:34
+
Jean-Marc Spaggiari 2013-10-24, 21:50
+
Harry Waye 2013-10-24, 22:02
+
Harry Waye 2013-10-25, 19:06
+
Ted Yu 2013-10-24, 21:50
+
Premal Shah 2013-10-26, 13:42
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB