Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> RE: EXTERNAL: Re: Failing Tablet Servers


Copy link to this message
-
RE: EXTERNAL: Re: Failing Tablet Servers
My guess would be that you are building an object several gigabytes in size
and Accumulo is copying it. Do you need all of those entries to be applied
atomically (in which case you should look into bulk loading), or can you
break them up into multiple mutations? I would say you should keep your
mutations under ten megabytes or so for performance. Bigger mutations won't
speed things up past that point.

Adam
On Sep 20, 2012 6:51 PM, "Cardon, Tejay E" <[EMAIL PROTECTED]> wrote:

>  Sorry, yes it’s the AccumuloOutputFormat.  I do about 1,000,000
> mutation.puts before I do a context.write.  Any idea how many is safe?****
>
> ** **
>
> Thanks,****
>
> Tejay****
>
> ** **
>
> *From:* Jim Klucar [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, September 20, 2012 4:44 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: EXTERNAL: Re: Failing Tablet Servers****
>
> ** **
>
> Do you mean AccumuloOutputFormat? Is the map failing or the reduce
> failing? How many Mutation.put are you doing before a context.write? Too
> many puts will crash the mutation object. You need to periodically call
> context.write and create a new mutation object. At some point I wrote a
> ContextFlushingMutation that handled this problem for you, but I'd have to
> dig around for it or rewrite it.****
>
>
>
> Sent from my iPhone****
>
>
> On Sep 20, 2012, at 5:29 PM, "Cardon, Tejay E" <[EMAIL PROTECTED]>
> wrote:****
>
>  John, ****
>
> Thanks for the quick response.  I’m not seeing any errors in the logger
> logs.  I am using native maps, and I left the memory map size at 1GB.  I
> assume that’s plenty large if I’m using native maps, right?****
>
>  ****
>
> Thanks,****
>
> Tejay****
>
>  ****
>
> *From:* John Vines [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, September 20, 2012 3:20 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* EXTERNAL: Re: Failing Tablet Servers****
>
>  ****
>
> Okay, so we know that you're killing servers. We know when you drop the
> amount of data down, you have no issues. There are two immediate issues
> that come to mind-
> 1. You modified tservers opts to give them 10G of memory. Did you up the
> memory map size in accumulo-site.xml to make those larger, or did you leave
> those alone? Or did you up them to match the 10G? If you upped them and
> arne't using the native maps, that would be problematic as you need space
> for other purposes as well.
>
> 2. You seem to be making giant rows. Depending on your Key/Value size,
> it's possible for you to write a row that you cannot send (especially if
> using a WholeRowIterator) that can cause a cascading error when doing log
> recovery. Are you seeing any sort of errors in your loggers logs?
>
> John****
>
> On Thu, Sep 20, 2012 at 5:05 PM, Cardon, Tejay E <[EMAIL PROTECTED]>
> wrote:****
>
> I’m seeing some strange behavior on a moderate (30 node) cluster.  I’ve
> got 27 tablet servers on large dell servers with 30GB of memory each.  I’ve
> set the TServer_OPTS to give them each 10G of memory.  I’m running an
> ingest process that uses AccumuloInputFormat in a MapReduce job to write
> 1,000 rows with each row containing ~1,000,000 columns in 160,000
> families.  The MapReduce initially runs quite quickly and I can see the
> ingest rate peak on the  monitor page.  However, after about 30 seconds of
> high ingest, the ingest falls to 0.  It then stalls out and my map task are
> eventually killed.  In the end, the map/reduce fails and I usually end up
> with between 3 and 7 of my Tservers dead.****
>
>  ****
>
> Inspecting the tserver.err logs shows nothing, even on the nodes that
> fail.  The tserver.out log shows a java OutOfMemoryError, and nothing
> else.  I’ve included a zip with the logs from one of the failed tservers
> and a second one with the logs from the master.  Other than the out of
> memory, I’m not seeing anything that stands out to me.****
>
>  ****
>
> If I reduce the data size to only 100,000 columns, rather than 1,000,000,
> the process takes about 4 minutes and completes without incident.****