kartheek muthyala 2011-11-29, 14:14
Uma Maheswara Rao G 2011-11-29, 14:40
kartheek muthyala 2011-11-29, 16:50
From: kartheek muthyala [[EMAIL PROTECTED]]
Sent: Tuesday, November 29, 2011 10:20 PM
To: [EMAIL PROTECTED]
Subject: Re: Generation Stamp
Uma, first of all thanks for the detailed exemplified explanation.
So to confirm, the primary use of having this generationTimeStamp is to ensure consistency of the block?. So, when the pipeline is failed at DN3, and the client invokes recovery, then the NN will chose DN1 to complete the pipeline. The DN1 first updates its metafile with the new time stamp, and then passes this information to the other replica at DN2. Further, in the future NN sees that this particular block is under replicated and it assigns some other DNa and asks either DN1/DN2 to replicate the same at DNa.
On Tue, Nov 29, 2011 at 8:10 PM, Uma Maheswara Rao G <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Generationstamp is basically to keep track of the replica states.
Consider one scenario where generation smap will be use:
Create a file which has one block. client started writing that block to DN1, DN 2, DN3 ( pipeline )
After writing some data DN3 failed, then Client will get the exception about pipeline failuere. Then Client will handle that exception ( you can see it in processDataNodeError in DataStreamer thread) . It will remove DN3 and will call the recovery for that block with new generation time stamp, then NN will choose one primary DN and assign block synchronization work.Then primary DN will ensure that all the remainnng block lengths are same ( if require it will truncate to consistant length) and will invoke committblckSynchronization. Then remaing datatransfer will resume.
now block will have new genartion timestamp. You can observe this in metadata file for that block in DN.
now the block will be like blk_12345634444<tel:12345634444>, blk_12345634444<tel:12345634444>_1234.meta
here 1234 is the generation timestamp.
Assume a case, after resuming the write again, DN2 fails, then again recovery will starts and will get new Generation time stamp again. now only DN1 in pipeline and block is blk_12345634444<tel:12345634444>, blk_12345634444<tel:12345634444>_1235.meta. resume the the remaing data writes and complted the last packet. With the last packet blocks should be finalized. DN1 is finalized the block successfully and DN1 will send blocks received command and block info will be updated in blocks map . Assume if DN2 comes back and sending that old block in reports to NN. Here NN can find that generation timestamp of that block is lesser than DN1 reported blocks genstamp. So, it can take the decision now. it can reject the lesser generation time stamp block.
Yu can see this code in FSNameSystem#addStoredBlock. ofcource there will be many conditions like length mismatch..etc
Hope it will help you....
From: kartheek muthyala [[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
Sent: Tuesday, November 29, 2011 7:44 PM
Subject: Generation Stamp
Why is there the concept of Generation Stamp that is getting tagged to the metadata of the block.? How is it useful? I have seen that in the hdfs current directory, the metafiles are tagged with this generation stamp. Does this keep track of the versioning?
kartheek muthyala 2011-11-30, 04:07
Zhanwei Wang 2011-11-30, 11:04
Uma Maheswara Rao G 2011-11-30, 12:01