The counters you're looking at are counted at the FileSystem interface
level, not at the more specific Task level (which have "map input
bytes" and such).
This means that if your map or reduce code is opening side-files/using
a FileSystem object to read extra things, the count will go up as
For simple input and output size validation of a job, minus anything
the code does on top, its better to look at "map/reduce input/output
bytes" form of counters instead.
On Tue, May 14, 2013 at 10:41 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:
> I have an iterative MapReduce job that I run over 35 GB of data repeatedly.
> The output of the first job is the input to the second one and it goes on
> like that until convergence.
> I am seeing a strange behavior with the program run time. The first
> iteration takes 4 minutes to run and here is how the counters look:
> HDFS_BYTES_READ 34,860,867,377
> HDFS_BYTES_WRITTEN 45,573,255,806
> The second iteration takes 15 minutes and here is how the counters look in
> this case:
> HDFS_BYTES_READ 144,563,459,448
> HDFS_BYTES_WRITTEN 49,779,966,388
> I cannot explain these numbers because the first iteration - to begin with -
> should only generate approximately 35 GB of output. When I check the output
> size using
> hadoop fs -dus
> I can confirm that it is indeed 35 GB. But for some reason
> HDFS_BYTES_WRITTEN shows 45 GB. Then the input to the second iteration
> should be 35 GB (or even 45GB considering the counter value)
> but HDFS_BYTES_READ shows 144 GB.
> All following iterations produce similar counter values to the second one
> and they roughly take 15 min each. My dfs replication factor is set to 1 and
> there is no compression turned on. All input and outputs are in SequenceFile
> format. The initial input is a sequence file that I generated locally using
> SequenceFile.Writer but I use the default values and as far as I know
> compression should be turned off. Am I wrong?
> Thanks in advance.