-Re: File Integrity in HDFS
Harsh J 2012-05-02, 18:32
Far as I can tell, the file moves are atomic. See
I've used this approach at my former workplace, and am sure there's a
lot of people using the same approach without hitting a scenario you
Note that its just the inode tree thats manipulated. The file itself,
in its completest sense, isn't "moved". Its just a rename, can't be
On Wed, May 2, 2012 at 9:20 PM, Stuti Awasthi <[EMAIL PROTECTED]> wrote:
> So lets consider a case that I copied the file from local to hdfs temporary directory and then after copying, I executed move to some Input dir. This takes fraction of seconds but lets assume that my job is running on that Input folder at that point in time when the file is getting moved and it tries to access the half moved file.
> Now what happens? Does HDFS throw some IOExecptions or it will leave the file unexecuted till next job runs.
> -----Original Message-----
> From: Harsh J [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, May 01, 2012 6:11 PM
> To: [EMAIL PROTECTED]
> Subject: Re: File Integrity in HDFS
> Yes renames/moves are merely metadata changes, like on your local filesystem (unless you move across partitions/disks, a concept that wouldn't apply to a DFS).
> On Tue, May 1, 2012 at 5:53 PM, Stuti Awasthi <[EMAIL PROTECTED]> wrote:
>> Thanks Harsh,
>> I also looked that when we are doing copying from Local to HDFS or HDFS to HDFS, it takes considerable time depending on file size but if we move within HDFS, it is done instantly.
>> So internally does HDFS just rename the file and its metadata?
>> -----Original Message-----
>> From: Harsh J [mailto:[EMAIL PROTECTED]]
>> Sent: Tuesday, May 01, 2012 5:22 PM
>> To: [EMAIL PROTECTED]
>> Subject: Re: File Integrity in HDFS
>> The easiest way out would be to rename files to pick-up-able name upon successful copy, or have the loading done to a different directory and rename/move the file when successfully closed to the job input directory.
>> On Tue, May 1, 2012 at 3:22 PM, Stuti Awasthi <[EMAIL PROTECTED]> wrote:
>>> Hi All,
>>> I have a scenario in which Input files are copied to HDFS and MR jobs
>>> run on the input directory.
>>> Now there can be a scenario in which file is getting copied to HDFS
>>> and MR jobs starts , in this case I do not want my MR job to pick
>>> those files which are getting copied to hdfs and process of copying is not complete.
>>> Is there any way/api to check that if the file is not completely
>>> written to HDFS we can know.
>>> Stuti Awasthi
>>> HCL Comnet Systems and Services Ltd
>>> F-8/9 Basement, Sec-3,Noida.
>>> The contents of this e-mail and any attachment(s) are confidential
>>> and intended for the named recipient(s) only.
>>> E-mail transmission cannot be guaranteed to be secure or error-free
>>> as information could be intercepted, corrupted, lost, destroyed,
>>> arrive late or incomplete, or contain viruses.The e mail and its
>>> contents (with or without referred
>>> errors) shall therefore not attach any liability on the originator or
>>> HCL or its affiliates. Any views or opinions presented in this email
>>> are solely those of the author and may not necessarily reflect the
>>> opinions of HCL or its affiliates. Any form of reproduction,
>>> dissemination, copying, disclosure, Modification, distribution and/or
>>> publication of this message without the prior written consent of the
>>> author of this e-mail is strictly prohibited. If you have received
>>> this email in error please delete it and notify the sender
>>> immediately. Before opening any mail and attachments please check
>>> them for viruses and defect.