|
|
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum errorAndrey Stepachev 2010-09-22, 09:50
Very strange. With habase over hadoop no such errors with checksums.
Very strange. I'll recheck on another big family. 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: > Thanks. Now i run the same job on > hbase 0.89 over cloudera hadoop instead of standalone mode. > May be here some bug in standalone mode, which prevents to > write correct data on disk. And later I'll check memory. > > Btw, linux is opensuse 11.0. 2.6.25.20-0.7-default 64 bit. > > 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >> So the client code looks good, hard to say what exactly is going on. >> >> BTW I opened this JIRA: >> https://issues.apache.org/jira/browse/HBASE-3029 >> >> To address the confusing exception in this case. >> >> It's hard to say why you get that exception under load... some systems >> have been known to give weird flaky faults under load. It used to be >> compiling the linux kernel was a simple benchmark for RAM problems. >> If you have time you could try memtest86 to see if the memory has >> issues, since that is a common place of errors. >> >> -ryan >> >> On Wed, Sep 22, 2010 at 2:29 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>> One more note. This database was 0.20.6 before. Then >>> i start 0.89 over it. >>> (but table with wrong checksum was created in 0.89 hbase) >>> >>> 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>: >>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>> why are you using such expensive disks? raid + hdfs = lower >>>>> performance than non-raid. >>>> >>>> It was database server, before we migrate to hbase. It was designed >>>> for postgresql. Now with compression and hbase nature our database >>>> is 12Gb instead of 180GB in pg. >>>> So this server was not designed for hbase. >>>> In production (0.20.6) we much lighter servers (3) with simle dual >>>> sata drives. >>>> >>>>> >>>>> how's your ram? hows your network switches? NICs? etc etc. >>>>> anything along the data path can introduce errors. >>>> >>>> no. all things on one machined. 17Gb ram (5GB hbase) >>>> >>>>> >>>>> in this case we did the right thing and threw exceptions, but looks >>>>> like your client continues to call next() despite getting >>>>> exceptions... can you check your client code to verify this? >>>> >>>> hm. i check. but i use only simple wrapper around ResultScanner >>>> http://pastebin.org/1074628. It should bail out on exception (except >>>> ScannerTimeoutException) >>>> >>>>> >>>>> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. >>>>>> >>>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>>> generally checksum errors are due to hardware faults of one kind or another. >>>>>>> >>>>>>> what is your hardware like? >>>>>>> >>>>>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >>>>>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>>>>>>> which reopen scanner on timeout. But what is best way to reopen scanner. >>>>>>>> Can you point me where i can find all this exceptions? Or may be >>>>>>>> here already some sort for recoveratble iterator? >>>>>>>> >>>>>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>: >>>>>>>>> ah ok i think i get it... basically at this point your scanner is bad >>>>>>>>> and iterating on it again won't work. the scanner should probably >>>>>>>>> self close itself so you get tons of additional exceptions but instead >>>>>>>>> we dont. >>>>>>>>> >>>>>>>>> there is probably a better fix for this, i'll ponder >>>>>>>>> >>>>>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>>>>>>>>> very strange... looks like a bad block ended up in your scanner and >>>>>>>>>> subsequent nexts were failing due to that short read. >>>>>>>>>> >>>>>>>>>> did you have to kill the regionserver or did things recover and >>>>>>>>>> continue normally? >>>>>>>>>> >>>>>>>>>> -ryan >>>>>>>>>> >>>>>>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: |