Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # dev >> Re: Accumulo v1.4.1 - ran out of memory and lost data (RESOLVED - Data was restored)


+
David Medinets 2013-01-28, 16:24
+
Christopher 2013-01-29, 23:49
+
David Medinets 2013-01-28, 13:28
Copy link to this message
-
Re: Accumulo v1.4.1 - ran out of memory and lost data
What version of accumulo was this?

So, you have evidence (such as a message in a log) that the tablet server
ran out of memory?  Can you post that information?

The ingested data should have been captured in the write-ahead log, and
recovered when the server was restarted.  There should never be any data
loss.

You should be able to ingest like this without a problem.  It is a basic
test.  "Hold time" is the mechanism by which ingest is pushed back so that
the tserver can get the data written to disk.  You should not have to
manually back off.  Also, the tserver dynamically changes the point at
which it flushes data from memory, so you should see less and less hold
time.

The garbage collector cannot run if the METADATA table is not online, or
has an inconsistent state.

You are probably seeing a lower number of tablets because not all the
tablets are online.  They are probably offline due to failed recoveries.

If you are running Accumulo 1.4, make sure you have stopped and restarted
all the loggers on the system.

-Eric

On Mon, Jan 28, 2013 at 8:28 AM, David Medinets <[EMAIL PROTECTED]>wrote:

> I had a plain Java program, single-threaded, that read an HDFS
> Sequence File with fairly small Sqoop records (probably under 200
> bytes each). As each record was read a Mutation was created, then
> written via Batch Writer to Accumulo. This program was as simple as it
> gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a
> date) so the ingest targeted one tablet. The ingest rate was over 150
> million entries for about 19 hours. Everything seemed fine. Over 3.5
> Billion entries were written. Then the nodes ran out of memory and
> Accumulo nodes went dead. 90% of the server was lost. And data poofed
> out of existence. Only 800M entries are visible now.
>
> We restarted the data node processes and the cluster has been running
> garbage collection for over 2 days.
>
> I did not expect this simple approach to cause an issue. From looking
> at the logs file, I think that at least two compactions were being run
> while still ingested those 176 million entries per hour. The hold
> times started rising and eventually the system simply ran out of
> memory. I have no certainty about this explanation though.
>
> My current thinking is to re-initialize Accumulo and find some way to
> programatically monitoring the hold time. The add a delay to the
> ingest process whenever the hold time rises over 30 seconds. Does that
> sound feasible?
>
> I know there are other approaches to ingest and I might give up this
> method and use another. I was trying to get some kind of baseline for
> analysis reasons with this approach.
>
+
John Vines 2013-01-28, 14:32
+
Keith Turner 2013-01-30, 16:30
+
John Vines 2013-01-30, 16:35
+
David Medinets 2013-01-30, 16:36