|
Andrey Stepachev
2010-09-22, 08:37
Ryan Rawson
2010-09-22, 08:57
Ryan Rawson
2010-09-22, 08:59
Andrey Stepachev
2010-09-22, 09:08
Ryan Rawson
2010-09-22, 09:11
Andrey Stepachev
2010-09-22, 09:14
Andrey Stepachev
2010-09-22, 09:17
Ryan Rawson
2010-09-22, 09:18
Andrey Stepachev
2010-09-22, 09:25
Andrey Stepachev
2010-09-22, 09:29
Ryan Rawson
2010-09-22, 09:38
Andrey Stepachev
2010-09-22, 09:44
Andrey Stepachev
2010-09-22, 09:50
Andrey Stepachev
2010-09-22, 11:09
|
-
hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorAndrey Stepachev 2010-09-22, 08:37
Hi All.
I get org.apache.hadoop.fs.ChecksumException for a table on heavy write in standalone mode. table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads writes data to it. At some moment exception thrown. Andrey.
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorRyan Rawson 2010-09-22, 08:57
very strange... looks like a bad block ended up in your scanner and
subsequent nexts were failing due to that short read. did you have to kill the regionserver or did things recover and continue normally? -ryan On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: > Hi All. > > I get org.apache.hadoop.fs.ChecksumException for a table on heavy > write in standalone mode. > table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads > writes data to it. > At some moment exception thrown. > > Andrey. >
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorRyan Rawson 2010-09-22, 08:59
ah ok i think i get it... basically at this point your scanner is bad
and iterating on it again won't work. the scanner should probably self close itself so you get tons of additional exceptions but instead we dont. there is probably a better fix for this, i'll ponder On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > very strange... looks like a bad block ended up in your scanner and > subsequent nexts were failing due to that short read. > > did you have to kill the regionserver or did things recover and > continue normally? > > -ryan > > On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >> Hi All. >> >> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >> write in standalone mode. >> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >> writes data to it. >> At some moment exception thrown. >> >> Andrey. >> >
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorAndrey Stepachev 2010-09-22, 09:08
But why it is bad? Split/compaction? I make my own RetryResultIterator
which reopen scanner on timeout. But what is best way to reopen scanner. Can you point me where i can find all this exceptions? Or may be here already some sort for recoveratble iterator? 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: > ah ok i think i get it... basically at this point your scanner is bad > and iterating on it again won't work. the scanner should probably > self close itself so you get tons of additional exceptions but instead > we dont. > > there is probably a better fix for this, i'll ponder > > On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >> very strange... looks like a bad block ended up in your scanner and >> subsequent nexts were failing due to that short read. >> >> did you have to kill the regionserver or did things recover and >> continue normally? >> >> -ryan >> >> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>> Hi All. >>> >>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>> write in standalone mode. >>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >>> writes data to it. >>> At some moment exception thrown. >>> >>> Andrey. >>> >> >
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorRyan Rawson 2010-09-22, 09:11
generally checksum errors are due to hardware faults of one kind or another.
what is your hardware like? On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: > But why it is bad? Split/compaction? I make my own RetryResultIterator > which reopen scanner on timeout. But what is best way to reopen scanner. > Can you point me where i can find all this exceptions? Or may be > here already some sort for recoveratble iterator? > > 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >> ah ok i think i get it... basically at this point your scanner is bad >> and iterating on it again won't work. the scanner should probably >> self close itself so you get tons of additional exceptions but instead >> we dont. >> >> there is probably a better fix for this, i'll ponder >> >> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>> very strange... looks like a bad block ended up in your scanner and >>> subsequent nexts were failing due to that short read. >>> >>> did you have to kill the regionserver or did things recover and >>> continue normally? >>> >>> -ryan >>> >>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>> Hi All. >>>> >>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>>> write in standalone mode. >>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >>>> writes data to it. >>>> At some moment exception thrown. >>>> >>>> Andrey. >>>> >>> >> >
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorAndrey Stepachev 2010-09-22, 09:14
hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core.
2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: > generally checksum errors are due to hardware faults of one kind or another. > > what is your hardware like? > > On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >> But why it is bad? Split/compaction? I make my own RetryResultIterator >> which reopen scanner on timeout. But what is best way to reopen scanner. >> Can you point me where i can find all this exceptions? Or may be >> here already some sort for recoveratble iterator? >> >> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>> ah ok i think i get it... basically at this point your scanner is bad >>> and iterating on it again won't work. the scanner should probably >>> self close itself so you get tons of additional exceptions but instead >>> we dont. >>> >>> there is probably a better fix for this, i'll ponder >>> >>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>>> very strange... looks like a bad block ended up in your scanner and >>>> subsequent nexts were failing due to that short read. >>>> >>>> did you have to kill the regionserver or did things recover and >>>> continue normally? >>>> >>>> -ryan >>>> >>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>> Hi All. >>>>> >>>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>>>> write in standalone mode. >>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >>>>> writes data to it. >>>>> At some moment exception thrown. >>>>> >>>>> Andrey. >>>>> >>>> >>> >> >
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorAndrey Stepachev 2010-09-22, 09:17
but yesterday hbase was 0.20.6 and exceptions was different
from my previous email: I need to massive data rewrite in some family on standalone server. I got org.apache.hadoop.hbase.NotServingRegionException or java.io.IOException: Region xxx closed if I write and read at the same time. 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: > hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. > > 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >> generally checksum errors are due to hardware faults of one kind or another. >> >> what is your hardware like? >> >> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>> which reopen scanner on timeout. But what is best way to reopen scanner. >>> Can you point me where i can find all this exceptions? Or may be >>> here already some sort for recoveratble iterator? >>> >>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>> ah ok i think i get it... basically at this point your scanner is bad >>>> and iterating on it again won't work. the scanner should probably >>>> self close itself so you get tons of additional exceptions but instead >>>> we dont. >>>> >>>> there is probably a better fix for this, i'll ponder >>>> >>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>>>> very strange... looks like a bad block ended up in your scanner and >>>>> subsequent nexts were failing due to that short read. >>>>> >>>>> did you have to kill the regionserver or did things recover and >>>>> continue normally? >>>>> >>>>> -ryan >>>>> >>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>> Hi All. >>>>>> >>>>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>>>>> write in standalone mode. >>>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >>>>>> writes data to it. >>>>>> At some moment exception thrown. >>>>>> >>>>>> Andrey. >>>>>> >>>>> >>>> >>> >> >
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorRyan Rawson 2010-09-22, 09:18
why are you using such expensive disks? raid + hdfs = lower
performance than non-raid. how's your ram? hows your network switches? NICs? etc etc. anything along the data path can introduce errors. in this case we did the right thing and threw exceptions, but looks like your client continues to call next() despite getting exceptions... can you check your client code to verify this? On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: > hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. > > 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >> generally checksum errors are due to hardware faults of one kind or another. >> >> what is your hardware like? >> >> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>> which reopen scanner on timeout. But what is best way to reopen scanner. >>> Can you point me where i can find all this exceptions? Or may be >>> here already some sort for recoveratble iterator? >>> >>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>> ah ok i think i get it... basically at this point your scanner is bad >>>> and iterating on it again won't work. the scanner should probably >>>> self close itself so you get tons of additional exceptions but instead >>>> we dont. >>>> >>>> there is probably a better fix for this, i'll ponder >>>> >>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>>>> very strange... looks like a bad block ended up in your scanner and >>>>> subsequent nexts were failing due to that short read. >>>>> >>>>> did you have to kill the regionserver or did things recover and >>>>> continue normally? >>>>> >>>>> -ryan >>>>> >>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>> Hi All. >>>>>> >>>>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>>>>> write in standalone mode. >>>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >>>>>> writes data to it. >>>>>> At some moment exception thrown. >>>>>> >>>>>> Andrey. >>>>>> >>>>> >>>> >>> >> >
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorAndrey Stepachev 2010-09-22, 09:25
2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>:
> why are you using such expensive disks? raid + hdfs = lower > performance than non-raid. It was database server, before we migrate to hbase. It was designed for postgresql. Now with compression and hbase nature our database is 12Gb instead of 180GB in pg. So this server was not designed for hbase. In production (0.20.6) we much lighter servers (3) with simle dual sata drives. > > how's your ram? hows your network switches? NICs? etc etc. > anything along the data path can introduce errors. no. all things on one machined. 17Gb ram (5GB hbase) > > in this case we did the right thing and threw exceptions, but looks > like your client continues to call next() despite getting > exceptions... can you check your client code to verify this? hm. i check. but i use only simple wrapper around ResultScanner http://pastebin.org/1074628. It should bail out on exception (except ScannerTimeoutException) > > On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. >> >> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>> generally checksum errors are due to hardware faults of one kind or another. >>> >>> what is your hardware like? >>> >>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>>> which reopen scanner on timeout. But what is best way to reopen scanner. >>>> Can you point me where i can find all this exceptions? Or may be >>>> here already some sort for recoveratble iterator? >>>> >>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>> ah ok i think i get it... basically at this point your scanner is bad >>>>> and iterating on it again won't work. the scanner should probably >>>>> self close itself so you get tons of additional exceptions but instead >>>>> we dont. >>>>> >>>>> there is probably a better fix for this, i'll ponder >>>>> >>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>>>>> very strange... looks like a bad block ended up in your scanner and >>>>>> subsequent nexts were failing due to that short read. >>>>>> >>>>>> did you have to kill the regionserver or did things recover and >>>>>> continue normally? >>>>>> >>>>>> -ryan >>>>>> >>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>>> Hi All. >>>>>>> >>>>>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>>>>>> write in standalone mode. >>>>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >>>>>>> writes data to it. >>>>>>> At some moment exception thrown. >>>>>>> >>>>>>> Andrey. >>>>>>> >>>>>> >>>>> >>>> >>> >> >
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorAndrey Stepachev 2010-09-22, 09:29
One more note. This database was 0.20.6 before. Then
i start 0.89 over it. (but table with wrong checksum was created in 0.89 hbase) 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: > 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >> why are you using such expensive disks? raid + hdfs = lower >> performance than non-raid. > > It was database server, before we migrate to hbase. It was designed > for postgresql. Now with compression and hbase nature our database > is 12Gb instead of 180GB in pg. > So this server was not designed for hbase. > In production (0.20.6) we much lighter servers (3) with simle dual > sata drives. > >> >> how's your ram? hows your network switches? NICs? etc etc. >> anything along the data path can introduce errors. > > no. all things on one machined. 17Gb ram (5GB hbase) > >> >> in this case we did the right thing and threw exceptions, but looks >> like your client continues to call next() despite getting >> exceptions... can you check your client code to verify this? > > hm. i check. but i use only simple wrapper around ResultScanner > http://pastebin.org/1074628. It should bail out on exception (except > ScannerTimeoutException) > >> >> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. >>> >>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>> generally checksum errors are due to hardware faults of one kind or another. >>>> >>>> what is your hardware like? >>>> >>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>>>> which reopen scanner on timeout. But what is best way to reopen scanner. >>>>> Can you point me where i can find all this exceptions? Or may be >>>>> here already some sort for recoveratble iterator? >>>>> >>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>> ah ok i think i get it... basically at this point your scanner is bad >>>>>> and iterating on it again won't work. the scanner should probably >>>>>> self close itself so you get tons of additional exceptions but instead >>>>>> we dont. >>>>>> >>>>>> there is probably a better fix for this, i'll ponder >>>>>> >>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>>>>>> very strange... looks like a bad block ended up in your scanner and >>>>>>> subsequent nexts were failing due to that short read. >>>>>>> >>>>>>> did you have to kill the regionserver or did things recover and >>>>>>> continue normally? >>>>>>> >>>>>>> -ryan >>>>>>> >>>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>>>> Hi All. >>>>>>>> >>>>>>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>>>>>>> write in standalone mode. >>>>>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >>>>>>>> writes data to it. >>>>>>>> At some moment exception thrown. >>>>>>>> >>>>>>>> Andrey. >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorRyan Rawson 2010-09-22, 09:38
So the client code looks good, hard to say what exactly is going on.
BTW I opened this JIRA: https://issues.apache.org/jira/browse/HBASE-3029 To address the confusing exception in this case. It's hard to say why you get that exception under load... some systems have been known to give weird flaky faults under load. It used to be compiling the linux kernel was a simple benchmark for RAM problems. If you have time you could try memtest86 to see if the memory has issues, since that is a common place of errors. -ryan On Wed, Sep 22, 2010 at 2:29 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: > One more note. This database was 0.20.6 before. Then > i start 0.89 over it. > (but table with wrong checksum was created in 0.89 hbase) > > 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: >> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>> why are you using such expensive disks? raid + hdfs = lower >>> performance than non-raid. >> >> It was database server, before we migrate to hbase. It was designed >> for postgresql. Now with compression and hbase nature our database >> is 12Gb instead of 180GB in pg. >> So this server was not designed for hbase. >> In production (0.20.6) we much lighter servers (3) with simle dual >> sata drives. >> >>> >>> how's your ram? hows your network switches? NICs? etc etc. >>> anything along the data path can introduce errors. >> >> no. all things on one machined. 17Gb ram (5GB hbase) >> >>> >>> in this case we did the right thing and threw exceptions, but looks >>> like your client continues to call next() despite getting >>> exceptions... can you check your client code to verify this? >> >> hm. i check. but i use only simple wrapper around ResultScanner >> http://pastebin.org/1074628. It should bail out on exception (except >> ScannerTimeoutException) >> >>> >>> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. >>>> >>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>> generally checksum errors are due to hardware faults of one kind or another. >>>>> >>>>> what is your hardware like? >>>>> >>>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>>>>> which reopen scanner on timeout. But what is best way to reopen scanner. >>>>>> Can you point me where i can find all this exceptions? Or may be >>>>>> here already some sort for recoveratble iterator? >>>>>> >>>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>>> ah ok i think i get it... basically at this point your scanner is bad >>>>>>> and iterating on it again won't work. the scanner should probably >>>>>>> self close itself so you get tons of additional exceptions but instead >>>>>>> we dont. >>>>>>> >>>>>>> there is probably a better fix for this, i'll ponder >>>>>>> >>>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>>>>>>> very strange... looks like a bad block ended up in your scanner and >>>>>>>> subsequent nexts were failing due to that short read. >>>>>>>> >>>>>>>> did you have to kill the regionserver or did things recover and >>>>>>>> continue normally? >>>>>>>> >>>>>>>> -ryan >>>>>>>> >>>>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>>>>> Hi All. >>>>>>>>> >>>>>>>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>>>>>>>> write in standalone mode. >>>>>>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >>>>>>>>> writes data to it. >>>>>>>>> At some moment exception thrown. >>>>>>>>> >>>>>>>>> Andrey. >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorAndrey Stepachev 2010-09-22, 09:44
Thanks. Now i run the same job on
hbase 0.89 over cloudera hadoop instead of standalone mode. May be here some bug in standalone mode, which prevents to write correct data on disk. And later I'll check memory. Btw, linux is opensuse 11.0. 2.6.25.20-0.7-default 64 bit. 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: > So the client code looks good, hard to say what exactly is going on. > > BTW I opened this JIRA: > https://issues.apache.org/jira/browse/HBASE-3029 > > To address the confusing exception in this case. > > It's hard to say why you get that exception under load... some systems > have been known to give weird flaky faults under load. It used to be > compiling the linux kernel was a simple benchmark for RAM problems. > If you have time you could try memtest86 to see if the memory has > issues, since that is a common place of errors. > > -ryan > > On Wed, Sep 22, 2010 at 2:29 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >> One more note. This database was 0.20.6 before. Then >> i start 0.89 over it. >> (but table with wrong checksum was created in 0.89 hbase) >> >> 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: >>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>> why are you using such expensive disks? raid + hdfs = lower >>>> performance than non-raid. >>> >>> It was database server, before we migrate to hbase. It was designed >>> for postgresql. Now with compression and hbase nature our database >>> is 12Gb instead of 180GB in pg. >>> So this server was not designed for hbase. >>> In production (0.20.6) we much lighter servers (3) with simle dual >>> sata drives. >>> >>>> >>>> how's your ram? hows your network switches? NICs? etc etc. >>>> anything along the data path can introduce errors. >>> >>> no. all things on one machined. 17Gb ram (5GB hbase) >>> >>>> >>>> in this case we did the right thing and threw exceptions, but looks >>>> like your client continues to call next() despite getting >>>> exceptions... can you check your client code to verify this? >>> >>> hm. i check. but i use only simple wrapper around ResultScanner >>> http://pastebin.org/1074628. It should bail out on exception (except >>> ScannerTimeoutException) >>> >>>> >>>> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. >>>>> >>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>> generally checksum errors are due to hardware faults of one kind or another. >>>>>> >>>>>> what is your hardware like? >>>>>> >>>>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>>>>>> which reopen scanner on timeout. But what is best way to reopen scanner. >>>>>>> Can you point me where i can find all this exceptions? Or may be >>>>>>> here already some sort for recoveratble iterator? >>>>>>> >>>>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>>>> ah ok i think i get it... basically at this point your scanner is bad >>>>>>>> and iterating on it again won't work. the scanner should probably >>>>>>>> self close itself so you get tons of additional exceptions but instead >>>>>>>> we dont. >>>>>>>> >>>>>>>> there is probably a better fix for this, i'll ponder >>>>>>>> >>>>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>>>>>>>> very strange... looks like a bad block ended up in your scanner and >>>>>>>>> subsequent nexts were failing due to that short read. >>>>>>>>> >>>>>>>>> did you have to kill the regionserver or did things recover and >>>>>>>>> continue normally? >>>>>>>>> >>>>>>>>> -ryan >>>>>>>>> >>>>>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>>>>>> Hi All. >>>>>>>>>> >>>>>>>>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>>>>>>>>> write in standalone mode. >>>>>>>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >>>>>>>>>> writes data to it.
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorAndrey Stepachev 2010-09-22, 09:50
Very strange. With habase over hadoop no such errors with checksums.
Very strange. I'll recheck on another big family. 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: > Thanks. Now i run the same job on > hbase 0.89 over cloudera hadoop instead of standalone mode. > May be here some bug in standalone mode, which prevents to > write correct data on disk. And later I'll check memory. > > Btw, linux is opensuse 11.0. 2.6.25.20-0.7-default 64 bit. > > 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >> So the client code looks good, hard to say what exactly is going on. >> >> BTW I opened this JIRA: >> https://issues.apache.org/jira/browse/HBASE-3029 >> >> To address the confusing exception in this case. >> >> It's hard to say why you get that exception under load... some systems >> have been known to give weird flaky faults under load. It used to be >> compiling the linux kernel was a simple benchmark for RAM problems. >> If you have time you could try memtest86 to see if the memory has >> issues, since that is a common place of errors. >> >> -ryan >> >> On Wed, Sep 22, 2010 at 2:29 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>> One more note. This database was 0.20.6 before. Then >>> i start 0.89 over it. >>> (but table with wrong checksum was created in 0.89 hbase) >>> >>> 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: >>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>> why are you using such expensive disks? raid + hdfs = lower >>>>> performance than non-raid. >>>> >>>> It was database server, before we migrate to hbase. It was designed >>>> for postgresql. Now with compression and hbase nature our database >>>> is 12Gb instead of 180GB in pg. >>>> So this server was not designed for hbase. >>>> In production (0.20.6) we much lighter servers (3) with simle dual >>>> sata drives. >>>> >>>>> >>>>> how's your ram? hows your network switches? NICs? etc etc. >>>>> anything along the data path can introduce errors. >>>> >>>> no. all things on one machined. 17Gb ram (5GB hbase) >>>> >>>>> >>>>> in this case we did the right thing and threw exceptions, but looks >>>>> like your client continues to call next() despite getting >>>>> exceptions... can you check your client code to verify this? >>>> >>>> hm. i check. but i use only simple wrapper around ResultScanner >>>> http://pastebin.org/1074628. It should bail out on exception (except >>>> ScannerTimeoutException) >>>> >>>>> >>>>> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. >>>>>> >>>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>>> generally checksum errors are due to hardware faults of one kind or another. >>>>>>> >>>>>>> what is your hardware like? >>>>>>> >>>>>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>>>>>>> which reopen scanner on timeout. But what is best way to reopen scanner. >>>>>>>> Can you point me where i can find all this exceptions? Or may be >>>>>>>> here already some sort for recoveratble iterator? >>>>>>>> >>>>>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>>>>> ah ok i think i get it... basically at this point your scanner is bad >>>>>>>>> and iterating on it again won't work. the scanner should probably >>>>>>>>> self close itself so you get tons of additional exceptions but instead >>>>>>>>> we dont. >>>>>>>>> >>>>>>>>> there is probably a better fix for this, i'll ponder >>>>>>>>> >>>>>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>>>>>>>>> very strange... looks like a bad block ended up in your scanner and >>>>>>>>>> subsequent nexts were failing due to that short read. >>>>>>>>>> >>>>>>>>>> did you have to kill the regionserver or did things recover and >>>>>>>>>> continue normally? >>>>>>>>>> >>>>>>>>>> -ryan >>>>>>>>>> >>>>>>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote:
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorAndrey Stepachev 2010-09-22, 11:09
Run same job on hbase over hadoop: all works like a sharm.
I can give to conclusions: 1. some bug in standalone mode 2. memory, but i think this is not a case (disks are same, memory are same, machine a same, workload is same, but result - differs). Later I'll try to write testcase 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: > Very strange. With habase over hadoop no such errors with checksums. > Very strange. I'll recheck on another big family. > > 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: >> Thanks. Now i run the same job on >> hbase 0.89 over cloudera hadoop instead of standalone mode. >> May be here some bug in standalone mode, which prevents to >> write correct data on disk. And later I'll check memory. >> >> Btw, linux is opensuse 11.0. 2.6.25.20-0.7-default 64 bit. >> >> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>> So the client code looks good, hard to say what exactly is going on. >>> >>> BTW I opened this JIRA: >>> https://issues.apache.org/jira/browse/HBASE-3029 >>> >>> To address the confusing exception in this case. >>> >>> It's hard to say why you get that exception under load... some systems >>> have been known to give weird flaky faults under load. It used to be >>> compiling the linux kernel was a simple benchmark for RAM problems. >>> If you have time you could try memtest86 to see if the memory has >>> issues, since that is a common place of errors. >>> >>> -ryan >>> >>> On Wed, Sep 22, 2010 at 2:29 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>> One more note. This database was 0.20.6 before. Then >>>> i start 0.89 over it. >>>> (but table with wrong checksum was created in 0.89 hbase) >>>> >>>> 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: >>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>> why are you using such expensive disks? raid + hdfs = lower >>>>>> performance than non-raid. >>>>> >>>>> It was database server, before we migrate to hbase. It was designed >>>>> for postgresql. Now with compression and hbase nature our database >>>>> is 12Gb instead of 180GB in pg. >>>>> So this server was not designed for hbase. >>>>> In production (0.20.6) we much lighter servers (3) with simle dual >>>>> sata drives. >>>>> >>>>>> >>>>>> how's your ram? hows your network switches? NICs? etc etc. >>>>>> anything along the data path can introduce errors. >>>>> >>>>> no. all things on one machined. 17Gb ram (5GB hbase) >>>>> >>>>>> >>>>>> in this case we did the right thing and threw exceptions, but looks >>>>>> like your client continues to call next() despite getting >>>>>> exceptions... can you check your client code to verify this? >>>>> >>>>> hm. i check. but i use only simple wrapper around ResultScanner >>>>> http://pastebin.org/1074628. It should bail out on exception (except >>>>> ScannerTimeoutException) >>>>> >>>>>> >>>>>> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. >>>>>>> >>>>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>>>> generally checksum errors are due to hardware faults of one kind or another. >>>>>>>> >>>>>>>> what is your hardware like? >>>>>>>> >>>>>>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>>>>>>>> which reopen scanner on timeout. But what is best way to reopen scanner. >>>>>>>>> Can you point me where i can find all this exceptions? Or may be >>>>>>>>> here already some sort for recoveratble iterator? >>>>>>>>> >>>>>>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>>>>>> ah ok i think i get it... basically at this point your scanner is bad >>>>>>>>>> and iterating on it again won't work. the scanner should probably >>>>>>>>>> self close itself so you get tons of additional exceptions but instead >>>>>>>>>> we dont. >>>>>>>>>> >>>>>>>>>> there is probably a better fix for this, i'll ponder >>>>>>>>>> >>>>>>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: |