-Distcp reliability issue
David Rosenstrauch 2013-02-28, 06:12
I've run into an issue with the reliability of distcp. Specifically, I
have a distcp job that seems to have not copied over a few files - and
yet didn't fail the job. Was hoping someone here might have some
So I ran a distcp job. (Copying from one Amazon S3 bucket to another.)
The job did have some task failures - but any task that failed
eventually got re-run successfully. The job as a whole completed
successfully, and seemed to think that all files were copied successfully.
However, I re-ran the distcp again afterwards just to make sure
everything copied over successfully, since it's important data. (And
also because I had canceled an earlier run of the same distcp, and I
wanted to make that didn't screw anything up.) And although the re-run
of the distcp skipped over most of the files (like it should) it
actually wound up copying 7 files - i.e., 7 files that didn't get copied
in the first job. This obviously shouldn't have happened, as it should
have copied over all of the files in the first run, and the second run
should have copied zero.
I have task logs (and job counters) saved that show all of this.
I think I remember a colleague of mine from a previous job running into
a situation like this before, where he wound up having to run distcp
jobs twice in order to reliably ensure that all files copied
successfully. But I don't know what (if anything) he eventually did to
work around the issue.
Anyone ever run into this before and/or have any pointers to discussions
about this issue or a solution? (Or even info about any home-grown
solution you've used to work around this.) Google didn't turn up much.