Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum error


Copy link to this message
-
Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum error
Very strange. With habase over hadoop no such errors with checksums.
Very strange. I'll recheck on another big family.

2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>:
> Thanks. Now i run the same job on
> hbase 0.89 over cloudera hadoop instead of standalone mode.
> May be here some bug in standalone mode, which prevents to
> write correct data on disk. And later I'll check memory.
>
> Btw, linux is opensuse 11.0. 2.6.25.20-0.7-default 64 bit.
>
> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>:
>> So the client code looks good, hard to say what exactly is going on.
>>
>> BTW I opened this JIRA:
>> https://issues.apache.org/jira/browse/HBASE-3029
>>
>> To address the confusing exception in this case.
>>
>> It's hard to say why you get that exception under load... some systems
>> have been known to give weird flaky faults under load.  It used to be
>> compiling the linux kernel was a simple benchmark for RAM problems.
>> If you have time you could try memtest86 to see if the memory has
>> issues, since that is a common place of errors.
>>
>> -ryan
>>
>> On Wed, Sep 22, 2010 at 2:29 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote:
>>> One more note. This database was 0.20.6 before. Then
>>> i start 0.89 over it.
>>> (but table with wrong checksum was created in 0.89 hbase)
>>>
>>> 2010/9/22 Andrey Stepachev <[EMAIL PROTECTED]>:
>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>:
>>>>> why are you using such expensive disks?  raid + hdfs = lower
>>>>> performance than non-raid.
>>>>
>>>> It was database server, before we migrate to hbase. It was designed
>>>> for postgresql. Now with compression and hbase nature our database
>>>> is 12Gb instead of 180GB in pg.
>>>> So this server was not designed for hbase.
>>>> In production (0.20.6) we much lighter servers (3) with simle dual
>>>> sata drives.
>>>>
>>>>>
>>>>> how's your ram?  hows your network switches?  NICs?  etc etc.
>>>>> anything along the data path can introduce errors.
>>>>
>>>> no. all things on one machined. 17Gb ram (5GB hbase)
>>>>
>>>>>
>>>>> in this case we did the right thing and threw exceptions, but looks
>>>>> like your client continues to call next() despite getting
>>>>> exceptions... can you check your client code to verify this?
>>>>
>>>> hm. i check. but i use only simple wrapper around ResultScanner
>>>> http://pastebin.org/1074628. It should bail out on exception (except
>>>> ScannerTimeoutException)
>>>>
>>>>>
>>>>> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote:
>>>>>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core.
>>>>>>
>>>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>:
>>>>>>> generally checksum errors are due to hardware faults of one kind or another.
>>>>>>>
>>>>>>> what is your hardware like?
>>>>>>>
>>>>>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote:
>>>>>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator
>>>>>>>> which reopen scanner on timeout. But what is best way to reopen scanner.
>>>>>>>> Can you point me where i can find all this exceptions? Or may be
>>>>>>>> here already some sort for recoveratble iterator?
>>>>>>>>
>>>>>>>> 2010/9/22 Ryan Rawson <[EMAIL PROTECTED]>:
>>>>>>>>> ah ok i think i get it... basically at this point your scanner is bad
>>>>>>>>> and iterating on it again won't work.  the scanner should probably
>>>>>>>>> self close itself so you get tons of additional exceptions but instead
>>>>>>>>> we dont.
>>>>>>>>>
>>>>>>>>> there is probably a better fix for this, i'll ponder
>>>>>>>>>
>>>>>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote:
>>>>>>>>>> very strange... looks like a bad block ended up in your scanner and
>>>>>>>>>> subsequent nexts were failing due to that short read.
>>>>>>>>>>
>>>>>>>>>> did you have to kill the regionserver or did things recover and
>>>>>>>>>> continue normally?
>>>>>>>>>>
>>>>>>>>>> -ryan
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: