Thanks for the useful feedback. You were right. My map tasks open
additional files from hdfs. The catch was that I had thousands of map tasks
being created and each of them was repeatedly reading the same files from
hdfs which ultimately dominated the job execution time. I re-arranged the
minimum split size for the job and reduced the number of map tasks spawned
by the master node.
On Thu, May 16, 2013 at 2:56 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> Hi Jim,
> The counters you're looking at are counted at the FileSystem interface
> level, not at the more specific Task level (which have "map input
> bytes" and such).
> This means that if your map or reduce code is opening side-files/using
> a FileSystem object to read extra things, the count will go up as
> For simple input and output size validation of a job, minus anything
> the code does on top, its better to look at "map/reduce input/output
> bytes" form of counters instead.
> On Tue, May 14, 2013 at 10:41 PM, Jim Twensky <[EMAIL PROTECTED]>
> > I have an iterative MapReduce job that I run over 35 GB of data
> > The output of the first job is the input to the second one and it goes on
> > like that until convergence.
> > I am seeing a strange behavior with the program run time. The first
> > iteration takes 4 minutes to run and here is how the counters look:
> > HDFS_BYTES_READ 34,860,867,377
> > HDFS_BYTES_WRITTEN 45,573,255,806
> > The second iteration takes 15 minutes and here is how the counters look
> > this case:
> > HDFS_BYTES_READ 144,563,459,448
> > HDFS_BYTES_WRITTEN 49,779,966,388
> > I cannot explain these numbers because the first iteration - to begin
> with -
> > should only generate approximately 35 GB of output. When I check the
> > size using
> > hadoop fs -dus
> > I can confirm that it is indeed 35 GB. But for some reason
> > HDFS_BYTES_WRITTEN shows 45 GB. Then the input to the second iteration
> > should be 35 GB (or even 45GB considering the counter value)
> > but HDFS_BYTES_READ shows 144 GB.
> > All following iterations produce similar counter values to the second one
> > and they roughly take 15 min each. My dfs replication factor is set to 1
> > there is no compression turned on. All input and outputs are in
> > format. The initial input is a sequence file that I generated locally
> > SequenceFile.Writer but I use the default values and as far as I know
> > compression should be turned off. Am I wrong?
> > Thanks in advance.
> Harsh J