Ran into a puzzling - and worrisome - issue late last night.
I was running a Hadoop streaming job, which reads input from 2 different
buckets in Amazon S3 (using s3n://). When the job completed, I realized
that the number of "map input records" was incorrect. (Several thousand
less than it should have been.) So I re-ran the job, and again got an
incorrect (and different!) map input record count. I wound up
eventually running the job 4 different times (on 2 different Hadoop
clusters at EC2) and got 4 different input record counts~
I eventually tried distcp'ing the files from off of S3 down to the local
HDFS, and re-ran the job off of HDFS, and then it worked fine. But the
fact that there were obviously silent I/O failures which I can't explain
This issue appears to be intermittent, as I just re-ran same the job
today twice in a row, and got the correct answer both times.
There's definitely nothing on my end that could explain this. I each
time ran the exact same code against the exact same data. (Data which
hasn't changed in several weeks.)
It almost appears that under certain conditions, reading from S3 using
S3n (i.e., NativeS3FileSystem) can sometimes result in a premature EOF.
I googled around, though, and didn't see anything that could explain this.
Anyone have any ideas what might be going on here and/or how to work around?
I wouldn't care so much if a Hadoop task (or even an entire job) failed
due to premature EOF's when reading from S3. But having silent failures
like this that result in incorrect output - is an unacceptable situation.