Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Question on HDFS_BYTES_READ and HDFS_BYTES_WRITTEN


Copy link to this message
-
Question on HDFS_BYTES_READ and HDFS_BYTES_WRITTEN
I have an iterative MapReduce job that I run over 35 GB of data repeatedly.
The output of the first job is the input to the second one and it goes on
like that until convergence.

I am seeing a strange behavior with the program run time. The first
iteration takes 4 minutes to run and here is how the counters look:

HDFS_BYTES_READ                    34,860,867,377
HDFS_BYTES_WRITTEN               45,573,255,806

The second iteration takes 15 minutes and here is how the counters look in
this case:

HDFS_BYTES_READ                   144,563,459,448
HDFS_BYTES_WRITTEN               49,779,966,388

I cannot explain these numbers because the first iteration - to begin with
- should only generate approximately 35 GB of output. When I check the
output size using

hadoop fs -dus

I can confirm that it is indeed 35 GB. But for some reason
HDFS_BYTES_WRITTEN shows 45 GB. Then the input to the second iteration
should be 35 GB (or even 45GB considering the counter value)
but HDFS_BYTES_READ shows 144 GB.

All following iterations produce similar counter values to the second one
and they roughly take 15 min each. My dfs replication factor is set to 1
and there is no compression turned on. All input and outputs are in
SequenceFile format. The initial input is a sequence file that I generated
locally using SequenceFile.Writer but I use the default values and as far
as I know compression should be turned off.  Am I wrong?

Thanks in advance.
+
Harsh J 2013-05-16, 07:56
+
Jim Twensky 2013-05-17, 21:50
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB