-Re: One Region Server fails - all M/R jobs crash.
David Koch 2013-11-25, 08:36
Yes, rows can get very big, that's why we filter them. The filter lets KVs
pass as long as the KV count is < MAX_LIMIT and skips the row entirely once
the count exceeds this limit. KV size is about constant. Alternatively, we
could use batching, you are right.
Also, with regard to the Java version used. Cloudera 4 installs its own JVM
which happens to be Java 7 so it's not a choice we made.
I always thought the principle of Hadoop/HBase was to do big data on
commodity hardware. You suggest we get 1 disk per CPU? I am by no means an
expert in setting up this kind of system.
Thanks again for your response,
On Fri, Nov 22, 2013 at 9:06 PM, Dhaval Shah <[EMAIL PROTECTED]>wrote:
> How big can your rows get? If you have a million columns on a row, you
> might run your region server out of memory. Can you try setBatch to a
> smaller number and test if that works?
> 10k regions is too many Can you try and increase your max file size and
> see if that helps.
> 8 cores / 1 disk is a bad combination. Can you look at disk IO during the
> time of crash and see if you find anything there.
> You might also be swapping. Can you look at your GC logs?
> You are running dangerously close to the fence with the kind of hardware
> you have.
> From: David Koch <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, 22 November 2013 2:43 PM
> Subject: Re: One Region Server fails - all M/R jobs crash.
> Thank you for your replies.
> Not that it matters but, cache is 1, batch is -1 on the scan i.e each RPC
> call returns one row. The jobs don't write any data back to HBase,
> compaction is de-activated and done manually. At the time of the crash all
> datanodes were fine, hbchk showed no inconsistencies. Table size is about
> 10k regions/3 billion records on the largest tables and we do a lot of
> server side filtering to limit what's sent across the network.
> Our machines may not be the most powerful, 32GB RAM, 8 cores, 1 disk. It's
> also true that when we took a closer look in the past it turned out that
> most of the issues we had were somehow rooted in the fact that CPUs were
> overloaded, not enough memory available - hardware stuff.
> What I don't get is why HBase always crashes. I mean if it's slow ok - the
> hardware is a bottleneck but at least you'd expect it to pull through
> eventually. Some days all jobs work fine, some days they don't and there is
> no telling why. HBase's erratic behavior has been causing us a lot of
> headache and we have been spending way too much time fiddling with HBase
> configuration settings over the past 18 months.
> On Fri, Nov 22, 2013 at 7:05 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > Thanks Dhaval for the analysis.
> > bq. The HBase version is 0.94.6
> > David:
> > Please upgrade to 0.94.13, if possible. There have been several JIRAs
> > backporting patches from trunk where jdk 1.7 is supported.
> > Please also check your DataNode log to see whether there was problem
> > (likely there was).
> > Cheers
> > On Sat, Nov 23, 2013 at 2:00 AM, Dhaval Shah <
> [EMAIL PROTECTED]
> > >wrote:
> > > You logs suggest that you are overloading resources
> > > (servers/network/memory). How much data are you scanning with your MR
> > job,
> > > how much are you writing back to HBase? What values are you setting for
> > > setBatch, setCaching, setCacheBlocks? How much memory do you have on
> > > region servers? 1 server crashing should not cause a job to fail
> > it
> > > will move on to the next one (given the right parmas for retries and
> > retry
> > > interval are set). Your region server logs suggest that its way more
> > > complicated than that.
> > >
> > > 2013-11-17 09:58:37,513 WARN
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for
> > > region we are already opening or closing;