Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> Accumulo v1.4.1 - ran out of memory and lost data


Copy link to this message
-
Re: Accumulo v1.4.1 - ran out of memory and lost data (RESOLVED - Data was restored)
Accumulo fully recovered when I restarted the loggers. Very impressive.

On Mon, Jan 28, 2013 at 9:32 AM, John Vines <[EMAIL PROTECTED]> wrote:
> And make sure the loggers didn't fill up their disk.
>
> Sent from my phone, please pardon the typos and brevity.
> On Jan 28, 2013 8:54 AM, "Eric Newton" <[EMAIL PROTECTED]> wrote:
>
>> What version of accumulo was this?
>>
>> So, you have evidence (such as a message in a log) that the tablet server
>> ran out of memory?  Can you post that information?
>>
>> The ingested data should have been captured in the write-ahead log, and
>> recovered when the server was restarted.  There should never be any data
>> loss.
>>
>> You should be able to ingest like this without a problem.  It is a basic
>> test.  "Hold time" is the mechanism by which ingest is pushed back so that
>> the tserver can get the data written to disk.  You should not have to
>> manually back off.  Also, the tserver dynamically changes the point at
>> which it flushes data from memory, so you should see less and less hold
>> time.
>>
>> The garbage collector cannot run if the METADATA table is not online, or
>> has an inconsistent state.
>>
>> You are probably seeing a lower number of tablets because not all the
>> tablets are online.  They are probably offline due to failed recoveries.
>>
>> If you are running Accumulo 1.4, make sure you have stopped and restarted
>> all the loggers on the system.
>>
>> -Eric
>>
>> On Mon, Jan 28, 2013 at 8:28 AM, David Medinets <[EMAIL PROTECTED]
>> >wrote:
>>
>> > I had a plain Java program, single-threaded, that read an HDFS
>> > Sequence File with fairly small Sqoop records (probably under 200
>> > bytes each). As each record was read a Mutation was created, then
>> > written via Batch Writer to Accumulo. This program was as simple as it
>> > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a
>> > date) so the ingest targeted one tablet. The ingest rate was over 150
>> > million entries for about 19 hours. Everything seemed fine. Over 3.5
>> > Billion entries were written. Then the nodes ran out of memory and
>> > Accumulo nodes went dead. 90% of the server was lost. And data poofed
>> > out of existence. Only 800M entries are visible now.
>> >
>> > We restarted the data node processes and the cluster has been running
>> > garbage collection for over 2 days.
>> >
>> > I did not expect this simple approach to cause an issue. From looking
>> > at the logs file, I think that at least two compactions were being run
>> > while still ingested those 176 million entries per hour. The hold
>> > times started rising and eventually the system simply ran out of
>> > memory. I have no certainty about this explanation though.
>> >
>> > My current thinking is to re-initialize Accumulo and find some way to
>> > programatically monitoring the hold time. The add a delay to the
>> > ingest process whenever the hold time rises over 30 seconds. Does that
>> > sound feasible?
>> >
>> > I know there are other approaches to ingest and I might give up this
>> > method and use another. I was trying to get some kind of baseline for
>> > analysis reasons with this approach.
>> >
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB