Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> Blocks are getting corrupted under very high load


Copy link to this message
-
RE: Blocks are getting corrupted under very high load
Yes, Todd,  block after restart is small and  genstamp also lesser.
   Here complete machine reboot happend. The boards are configured like, if it is not getting any CPU cycles  for 480secs, it will reboot himself.
  kernal.hung_task_timeout_secs = 480 sec.
  Due to this timeout, automatically reboot happend for DN.
Regards,
Uma

________________________________________
From: Todd Lipcon [[EMAIL PROTECTED]]
Sent: Wednesday, November 23, 2011 2:08 PM
To: [EMAIL PROTECTED]
Subject: Re: Blocks are getting corrupted under very high load

I noticed that the reported block after restart is also much smaller
than the block reported earlier.

Any chance when your DN restarted it actually lost power? ie do you
mean that just the DN JVM restarted, or that the whole machine crashed
and restarted?

On Tue, Nov 22, 2011 at 9:17 PM, Uma Maheswara Rao G
<[EMAIL PROTECTED]> wrote:
> Hi Todd,
> Thanks a lot for taking a look.
>
> Yes, I also suspect the same initially. But after analysing the logs, we found below client trace logs in DN, which means block finalization completed.
> ./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  DataNode.clienttrace (BlockReceiver.java:run(1130)) - src: /xx.xx.132.26:55882, dest: /xx.xx.132.22:10010, bytes: 255954944, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_827638122_13, srvID: DS-1518903564-158.1.132.22-10010-1321492867433, blockid: blk_1321803251510_85379
> ./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  datanode.DataNode (BlockReceiver.java:run(1185)) - PacketResponder 0 for block blk_1321803251510_85379 terminating
>
>
> blk_1321803251510_85379 is recent generationTimeStamp. Also NN logs clearly saying that addStoredBlock called for this block id.
>
> Below are the logs..
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,836 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2415)) - commitBlockSynchronization(lastblock=blk_1321803251510_83627, newgenerationstamp=85379, newlength=246505984, newtargets=[xx.xx.132.22:10010], closeFile=false, deleteBlock=false)
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,869 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2488)) - commitBlockSynchronization(blk_1321803251510_85379) successful
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  namenode.FSNamesystem (FSNamesystem.java:addStoredBlock(3708)) - Inconsistent size for block blk_1321803251510_85379 reported from xx.xx.132.22:10010 current size is 246505984 reported size is 255954944
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  hdfs.StateChange (FSNamesystem.java:addStoredBlock(3800)) - BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_1321803251510_85379 on xx.xx.132.22:10010 size 255954944
>
>
> After DN restart:
> ./hadoop-root-HANameNode-xx-xx-132-27.log.7:2011-11-20 18:55:54,844 INFO  hdfs.StateChange (FSNamesystem.java:rejectAddStoredBlock(3520)) - BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1321803251510_83627 on xx.xx.132.22:10010 size 104428544 but was rejected: Reported as block being written but is a block of closed file.
>
> Regards,
> Uma
>
> ________________________________________
> From: Todd Lipcon [[EMAIL PROTECTED]]
> Sent: Wednesday, November 23, 2011 6:27 AM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: Blocks are getting corrupted under very high load
>
> Can you look on the DN in question and see whether it was succesfully
> finalized when the write finished? It doesn't sound like a successful
> write -- should have moved it out of the bbw directory into current/
>
> -Todd
>
> On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G
> <[EMAIL PROTECTED]> wrote:
>> Hi All,
>>
>>
>>
>> I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.
>>
>> We are running a load test, as usual. (We want to ensure the reliability of the system under heavy loads.)

Todd Lipcon
Software Engineer, Cloudera