I have an iterative MapReduce job that I run over 35 GB of data repeatedly.
The output of the first job is the input to the second one and it goes on
like that until convergence.
I am seeing a strange behavior with the program run time. The first
iteration takes 4 minutes to run and here is how the counters look:
The second iteration takes 15 minutes and here is how the counters look in
I cannot explain these numbers because the first iteration - to begin with
- should only generate approximately 35 GB of output. When I check the
output size using
hadoop fs -dus
I can confirm that it is indeed 35 GB. But for some reason
HDFS_BYTES_WRITTEN shows 45 GB. Then the input to the second iteration
should be 35 GB (or even 45GB considering the counter value)
but HDFS_BYTES_READ shows 144 GB.
All following iterations produce similar counter values to the second one
and they roughly take 15 min each. My dfs replication factor is set to 1
and there is no compression turned on. All input and outputs are in
SequenceFile format. The initial input is a sequence file that I generated
locally using SequenceFile.Writer but I use the default values and as far
as I know compression should be turned off. Am I wrong?
Thanks in advance.