|
David Medinets
2013-01-28, 13:28
Eric Newton
2013-01-28, 13:53
John Vines
2013-01-28, 14:32
David Medinets
2013-01-28, 16:24
Christopher
2013-01-29, 23:49
Keith Turner
2013-01-30, 16:30
John Vines
2013-01-30, 16:35
David Medinets
2013-01-30, 16:36
|
-
Accumulo v1.4.1 - ran out of memory and lost dataDavid Medinets 2013-01-28, 13:28
I had a plain Java program, single-threaded, that read an HDFS
Sequence File with fairly small Sqoop records (probably under 200 bytes each). As each record was read a Mutation was created, then written via Batch Writer to Accumulo. This program was as simple as it gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a date) so the ingest targeted one tablet. The ingest rate was over 150 million entries for about 19 hours. Everything seemed fine. Over 3.5 Billion entries were written. Then the nodes ran out of memory and Accumulo nodes went dead. 90% of the server was lost. And data poofed out of existence. Only 800M entries are visible now. We restarted the data node processes and the cluster has been running garbage collection for over 2 days. I did not expect this simple approach to cause an issue. From looking at the logs file, I think that at least two compactions were being run while still ingested those 176 million entries per hour. The hold times started rising and eventually the system simply ran out of memory. I have no certainty about this explanation though. My current thinking is to re-initialize Accumulo and find some way to programatically monitoring the hold time. The add a delay to the ingest process whenever the hold time rises over 30 seconds. Does that sound feasible? I know there are other approaches to ingest and I might give up this method and use another. I was trying to get some kind of baseline for analysis reasons with this approach.
-
Re: Accumulo v1.4.1 - ran out of memory and lost dataEric Newton 2013-01-28, 13:53
What version of accumulo was this?
So, you have evidence (such as a message in a log) that the tablet server ran out of memory? Can you post that information? The ingested data should have been captured in the write-ahead log, and recovered when the server was restarted. There should never be any data loss. You should be able to ingest like this without a problem. It is a basic test. "Hold time" is the mechanism by which ingest is pushed back so that the tserver can get the data written to disk. You should not have to manually back off. Also, the tserver dynamically changes the point at which it flushes data from memory, so you should see less and less hold time. The garbage collector cannot run if the METADATA table is not online, or has an inconsistent state. You are probably seeing a lower number of tablets because not all the tablets are online. They are probably offline due to failed recoveries. If you are running Accumulo 1.4, make sure you have stopped and restarted all the loggers on the system. -Eric On Mon, Jan 28, 2013 at 8:28 AM, David Medinets <[EMAIL PROTECTED]>wrote: > I had a plain Java program, single-threaded, that read an HDFS > Sequence File with fairly small Sqoop records (probably under 200 > bytes each). As each record was read a Mutation was created, then > written via Batch Writer to Accumulo. This program was as simple as it > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a > date) so the ingest targeted one tablet. The ingest rate was over 150 > million entries for about 19 hours. Everything seemed fine. Over 3.5 > Billion entries were written. Then the nodes ran out of memory and > Accumulo nodes went dead. 90% of the server was lost. And data poofed > out of existence. Only 800M entries are visible now. > > We restarted the data node processes and the cluster has been running > garbage collection for over 2 days. > > I did not expect this simple approach to cause an issue. From looking > at the logs file, I think that at least two compactions were being run > while still ingested those 176 million entries per hour. The hold > times started rising and eventually the system simply ran out of > memory. I have no certainty about this explanation though. > > My current thinking is to re-initialize Accumulo and find some way to > programatically monitoring the hold time. The add a delay to the > ingest process whenever the hold time rises over 30 seconds. Does that > sound feasible? > > I know there are other approaches to ingest and I might give up this > method and use another. I was trying to get some kind of baseline for > analysis reasons with this approach. >
-
Re: Accumulo v1.4.1 - ran out of memory and lost dataJohn Vines 2013-01-28, 14:32
And make sure the loggers didn't fill up their disk.
Sent from my phone, please pardon the typos and brevity. On Jan 28, 2013 8:54 AM, "Eric Newton" <[EMAIL PROTECTED]> wrote: > What version of accumulo was this? > > So, you have evidence (such as a message in a log) that the tablet server > ran out of memory? Can you post that information? > > The ingested data should have been captured in the write-ahead log, and > recovered when the server was restarted. There should never be any data > loss. > > You should be able to ingest like this without a problem. It is a basic > test. "Hold time" is the mechanism by which ingest is pushed back so that > the tserver can get the data written to disk. You should not have to > manually back off. Also, the tserver dynamically changes the point at > which it flushes data from memory, so you should see less and less hold > time. > > The garbage collector cannot run if the METADATA table is not online, or > has an inconsistent state. > > You are probably seeing a lower number of tablets because not all the > tablets are online. They are probably offline due to failed recoveries. > > If you are running Accumulo 1.4, make sure you have stopped and restarted > all the loggers on the system. > > -Eric > > On Mon, Jan 28, 2013 at 8:28 AM, David Medinets <[EMAIL PROTECTED] > >wrote: > > > I had a plain Java program, single-threaded, that read an HDFS > > Sequence File with fairly small Sqoop records (probably under 200 > > bytes each). As each record was read a Mutation was created, then > > written via Batch Writer to Accumulo. This program was as simple as it > > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a > > date) so the ingest targeted one tablet. The ingest rate was over 150 > > million entries for about 19 hours. Everything seemed fine. Over 3.5 > > Billion entries were written. Then the nodes ran out of memory and > > Accumulo nodes went dead. 90% of the server was lost. And data poofed > > out of existence. Only 800M entries are visible now. > > > > We restarted the data node processes and the cluster has been running > > garbage collection for over 2 days. > > > > I did not expect this simple approach to cause an issue. From looking > > at the logs file, I think that at least two compactions were being run > > while still ingested those 176 million entries per hour. The hold > > times started rising and eventually the system simply ran out of > > memory. I have no certainty about this explanation though. > > > > My current thinking is to re-initialize Accumulo and find some way to > > programatically monitoring the hold time. The add a delay to the > > ingest process whenever the hold time rises over 30 seconds. Does that > > sound feasible? > > > > I know there are other approaches to ingest and I might give up this > > method and use another. I was trying to get some kind of baseline for > > analysis reasons with this approach. > > >
-
Re: Accumulo v1.4.1 - ran out of memory and lost data (RESOLVED - Data was restored)David Medinets 2013-01-28, 16:24
Accumulo fully recovered when I restarted the loggers. Very impressive.
On Mon, Jan 28, 2013 at 9:32 AM, John Vines <[EMAIL PROTECTED]> wrote: > And make sure the loggers didn't fill up their disk. > > Sent from my phone, please pardon the typos and brevity. > On Jan 28, 2013 8:54 AM, "Eric Newton" <[EMAIL PROTECTED]> wrote: > >> What version of accumulo was this? >> >> So, you have evidence (such as a message in a log) that the tablet server >> ran out of memory? Can you post that information? >> >> The ingested data should have been captured in the write-ahead log, and >> recovered when the server was restarted. There should never be any data >> loss. >> >> You should be able to ingest like this without a problem. It is a basic >> test. "Hold time" is the mechanism by which ingest is pushed back so that >> the tserver can get the data written to disk. You should not have to >> manually back off. Also, the tserver dynamically changes the point at >> which it flushes data from memory, so you should see less and less hold >> time. >> >> The garbage collector cannot run if the METADATA table is not online, or >> has an inconsistent state. >> >> You are probably seeing a lower number of tablets because not all the >> tablets are online. They are probably offline due to failed recoveries. >> >> If you are running Accumulo 1.4, make sure you have stopped and restarted >> all the loggers on the system. >> >> -Eric >> >> On Mon, Jan 28, 2013 at 8:28 AM, David Medinets <[EMAIL PROTECTED] >> >wrote: >> >> > I had a plain Java program, single-threaded, that read an HDFS >> > Sequence File with fairly small Sqoop records (probably under 200 >> > bytes each). As each record was read a Mutation was created, then >> > written via Batch Writer to Accumulo. This program was as simple as it >> > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a >> > date) so the ingest targeted one tablet. The ingest rate was over 150 >> > million entries for about 19 hours. Everything seemed fine. Over 3.5 >> > Billion entries were written. Then the nodes ran out of memory and >> > Accumulo nodes went dead. 90% of the server was lost. And data poofed >> > out of existence. Only 800M entries are visible now. >> > >> > We restarted the data node processes and the cluster has been running >> > garbage collection for over 2 days. >> > >> > I did not expect this simple approach to cause an issue. From looking >> > at the logs file, I think that at least two compactions were being run >> > while still ingested those 176 million entries per hour. The hold >> > times started rising and eventually the system simply ran out of >> > memory. I have no certainty about this explanation though. >> > >> > My current thinking is to re-initialize Accumulo and find some way to >> > programatically monitoring the hold time. The add a delay to the >> > ingest process whenever the hold time rises over 30 seconds. Does that >> > sound feasible? >> > >> > I know there are other approaches to ingest and I might give up this >> > method and use another. I was trying to get some kind of baseline for >> > analysis reasons with this approach. >> > >>
-
Re: Accumulo v1.4.1 - ran out of memory and lost data (RESOLVED - Data was restored)Christopher 2013-01-29, 23:49
FYI - changing the subject line puts the email in a different thread.
Probably best to avoid that. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Mon, Jan 28, 2013 at 11:24 AM, David Medinets <[EMAIL PROTECTED]>wrote: > Accumulo fully recovered when I restarted the loggers. Very impressive. > > On Mon, Jan 28, 2013 at 9:32 AM, John Vines <[EMAIL PROTECTED]> wrote: > > And make sure the loggers didn't fill up their disk. > > > > Sent from my phone, please pardon the typos and brevity. > > On Jan 28, 2013 8:54 AM, "Eric Newton" <[EMAIL PROTECTED]> wrote: > > > >> What version of accumulo was this? > >> > >> So, you have evidence (such as a message in a log) that the tablet > server > >> ran out of memory? Can you post that information? > >> > >> The ingested data should have been captured in the write-ahead log, and > >> recovered when the server was restarted. There should never be any data > >> loss. > >> > >> You should be able to ingest like this without a problem. It is a basic > >> test. "Hold time" is the mechanism by which ingest is pushed back so > that > >> the tserver can get the data written to disk. You should not have to > >> manually back off. Also, the tserver dynamically changes the point at > >> which it flushes data from memory, so you should see less and less hold > >> time. > >> > >> The garbage collector cannot run if the METADATA table is not online, or > >> has an inconsistent state. > >> > >> You are probably seeing a lower number of tablets because not all the > >> tablets are online. They are probably offline due to failed recoveries. > >> > >> If you are running Accumulo 1.4, make sure you have stopped and > restarted > >> all the loggers on the system. > >> > >> -Eric > >> > >> On Mon, Jan 28, 2013 at 8:28 AM, David Medinets < > [EMAIL PROTECTED] > >> >wrote: > >> > >> > I had a plain Java program, single-threaded, that read an HDFS > >> > Sequence File with fairly small Sqoop records (probably under 200 > >> > bytes each). As each record was read a Mutation was created, then > >> > written via Batch Writer to Accumulo. This program was as simple as it > >> > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a > >> > date) so the ingest targeted one tablet. The ingest rate was over 150 > >> > million entries for about 19 hours. Everything seemed fine. Over 3.5 > >> > Billion entries were written. Then the nodes ran out of memory and > >> > Accumulo nodes went dead. 90% of the server was lost. And data poofed > >> > out of existence. Only 800M entries are visible now. > >> > > >> > We restarted the data node processes and the cluster has been running > >> > garbage collection for over 2 days. > >> > > >> > I did not expect this simple approach to cause an issue. From looking > >> > at the logs file, I think that at least two compactions were being run > >> > while still ingested those 176 million entries per hour. The hold > >> > times started rising and eventually the system simply ran out of > >> > memory. I have no certainty about this explanation though. > >> > > >> > My current thinking is to re-initialize Accumulo and find some way to > >> > programatically monitoring the hold time. The add a delay to the > >> > ingest process whenever the hold time rises over 30 seconds. Does that > >> > sound feasible? > >> > > >> > I know there are other approaches to ingest and I might give up this > >> > method and use another. I was trying to get some kind of baseline for > >> > analysis reasons with this approach. > >> > > >> >
-
Re: Accumulo v1.4.1 - ran out of memory and lost dataKeith Turner 2013-01-30, 16:30
Was this resolved?
On Mon, Jan 28, 2013 at 8:28 AM, David Medinets <[EMAIL PROTECTED]> wrote: > I had a plain Java program, single-threaded, that read an HDFS > Sequence File with fairly small Sqoop records (probably under 200 > bytes each). As each record was read a Mutation was created, then > written via Batch Writer to Accumulo. This program was as simple as it > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a > date) so the ingest targeted one tablet. The ingest rate was over 150 > million entries for about 19 hours. Everything seemed fine. Over 3.5 > Billion entries were written. Then the nodes ran out of memory and > Accumulo nodes went dead. 90% of the server was lost. And data poofed > out of existence. Only 800M entries are visible now. > > We restarted the data node processes and the cluster has been running > garbage collection for over 2 days. > > I did not expect this simple approach to cause an issue. From looking > at the logs file, I think that at least two compactions were being run > while still ingested those 176 million entries per hour. The hold > times started rising and eventually the system simply ran out of > memory. I have no certainty about this explanation though. > > My current thinking is to re-initialize Accumulo and find some way to > programatically monitoring the hold time. The add a delay to the > ingest process whenever the hold time rises over 30 seconds. Does that > sound feasible? > > I know there are other approaches to ingest and I might give up this > method and use another. I was trying to get some kind of baseline for > analysis reasons with this approach.
-
Re: Accumulo v1.4.1 - ran out of memory and lost dataJohn Vines 2013-01-30, 16:35
yes
On Wed, Jan 30, 2013 at 11:30 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > Was this resolved? > > On Mon, Jan 28, 2013 at 8:28 AM, David Medinets > <[EMAIL PROTECTED]> wrote: > > I had a plain Java program, single-threaded, that read an HDFS > > Sequence File with fairly small Sqoop records (probably under 200 > > bytes each). As each record was read a Mutation was created, then > > written via Batch Writer to Accumulo. This program was as simple as it > > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a > > date) so the ingest targeted one tablet. The ingest rate was over 150 > > million entries for about 19 hours. Everything seemed fine. Over 3.5 > > Billion entries were written. Then the nodes ran out of memory and > > Accumulo nodes went dead. 90% of the server was lost. And data poofed > > out of existence. Only 800M entries are visible now. > > > > We restarted the data node processes and the cluster has been running > > garbage collection for over 2 days. > > > > I did not expect this simple approach to cause an issue. From looking > > at the logs file, I think that at least two compactions were being run > > while still ingested those 176 million entries per hour. The hold > > times started rising and eventually the system simply ran out of > > memory. I have no certainty about this explanation though. > > > > My current thinking is to re-initialize Accumulo and find some way to > > programatically monitoring the hold time. The add a delay to the > > ingest process whenever the hold time rises over 30 seconds. Does that > > sound feasible? > > > > I know there are other approaches to ingest and I might give up this > > method and use another. I was trying to get some kind of baseline for > > analysis reasons with this approach. >
-
Re: Accumulo v1.4.1 - ran out of memory and lost dataDavid Medinets 2013-01-30, 16:36
Yes. Accumulo fully recovered when I restarted the loggers.
On Wed, Jan 30, 2013 at 11:30 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > Was this resolved? > > On Mon, Jan 28, 2013 at 8:28 AM, David Medinets > <[EMAIL PROTECTED]> wrote: >> I had a plain Java program, single-threaded, that read an HDFS >> Sequence File with fairly small Sqoop records (probably under 200 >> bytes each). As each record was read a Mutation was created, then >> written via Batch Writer to Accumulo. This program was as simple as it >> gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a >> date) so the ingest targeted one tablet. The ingest rate was over 150 >> million entries for about 19 hours. Everything seemed fine. Over 3.5 >> Billion entries were written. Then the nodes ran out of memory and >> Accumulo nodes went dead. 90% of the server was lost. And data poofed >> out of existence. Only 800M entries are visible now. >> >> We restarted the data node processes and the cluster has been running >> garbage collection for over 2 days. >> >> I did not expect this simple approach to cause an issue. From looking >> at the logs file, I think that at least two compactions were being run >> while still ingested those 176 million entries per hour. The hold >> times started rising and eventually the system simply ran out of >> memory. I have no certainty about this explanation though. >> >> My current thinking is to re-initialize Accumulo and find some way to >> programatically monitoring the hold time. The add a delay to the >> ingest process whenever the hold time rises over 30 seconds. Does that >> sound feasible? >> >> I know there are other approaches to ingest and I might give up this >> method and use another. I was trying to get some kind of baseline for >> analysis reasons with this approach. |