Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> how to handle the corrupt block in HDFS?


Copy link to this message
-
Re: how to handle the corrupt block in HDFS?
I have only 1-node cluster, so I am not able to verify it when replication
factor is bigger than 1.

I run the fsck on a file that consists of 3 blocks, and 1 block has a
corrupt replica. fsck told that the system is HEALTHY.

When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
started and it detected a corrupted replica. Then I run fsck again on that
file, and it told me that the system is CORRUPT.

If you have a small (and non-production) cluster, can you restart your
datandoes and run fsck again?

2013/12/11 ch huang <[EMAIL PROTECTED]>

> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
> can not tell you which block of which file has a replica been
> corrupted,fsck just useful on all of one block's replica bad
>
> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <[EMAIL PROTECTED]> wrote:
>
>> When you identify a file with corrupt block(s), then you can locate the
>> machines that stores its block by typing
>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>
>>
>> 2013/12/11 Adam Kawa <[EMAIL PROTECTED]>
>>
>>> Maybe this can work for you
>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>> ?
>>>
>>>
>>> 2013/12/11 ch huang <[EMAIL PROTECTED]>
>>>
>>>> thanks for reply, what i do not know is how can i locate the block
>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>> replica will be removed and a new health replica replace it,because i get
>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>> replica and clean it)
>>>>
>>>>
>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>>  Hi ch huang,
>>>>>
>>>>>
>>>>>
>>>>> It may seem strange, but the fact is,
>>>>>
>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>> though jconsole for description.
>>>>>
>>>>>
>>>>>
>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>
>>>>>
>>>>>
>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>> more datanode available in your cluster and block replicated to that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>> make it accessible faster and launch map tasks faster. *
>>>>>
>>>>> Anyway, if the job is success these files will be deleted
>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>
>>>>>
>>>>>
>>>>> Thanks and Regards,
>>>>>
>>>>> Vinayakumar B
>>>>>
>>>>>
>>>>>
>>>>> *From:* Peter Marron [mailto:[EMAIL PROTECTED]]
>>>>> *Sent:* 10 December 2013 14:19
>>>>> *To:* [EMAIL PROTECTED]
>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I am sure that there are others who will answer this better, but
>>>>> anyway.
>>>>>
>>>>> The default replication level for files in HDFS is 3 and so most files
>>>>> that you
>>>>>
>>>>> see will have a replication level of 3. However when you run a
>>>>> Map/Reduce
>>>>>
>>>>> job the system knows in advance that every node will need a copy of
>>>>>
>>>>> certain files. Specifically the job.xml and the various jars containing
>>>>>
>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>
>>>>> system arranges that some of these files have a higher replication
>>>>> level. This increases
>>>>>
>>>>> the chances that a copy will be found locally.
>>>>>
>>>>> By default this higher replication level is 10.
>>>>>
>>>>>
>>>>>
>>>>> This can seem a little odd on a cluster where you only have, say, 3