Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Calculation of BYTES_READ counter in TrackedRecordReader can give incorrect result


Copy link to this message
-
Calculation of BYTES_READ counter in TrackedRecordReader can give incorrect result
Subroto 2012-06-21, 22:53
Hi,

I have a RecordReader implementation which reads the records asynchronously and caches them in memory(In a BlockingQueue).
When TrackingRecordReader calls for next Record, the internal implementation of RecordReader reads from the queue and supplies the record to MapTask.
The TrackingRecordReader increments the BYTES_READ  counter by calculating:
bytesInCurr - bytesInPrev
where bytesIncurr is FSStatistics byte read after the call to next and bytesInPrev is before call to next.
As the records are already read before making a call to next most of the time bytesInCurr - bytesInPrev results to zero or some other value if the Asynchronous Thread is running in background.
Earlier the BYTES_READ counter was handled by getPos() method which my RecordReader use to take care properly.

Would like to get opinion if the current behavior of calculating BYTES_READ in TrackingRecordReader is correct as it compels the user to read the records in synchronous fashion.

Please let me know if there is any workaround for getting the correct statistics from the MR job.

Cheers,
Subroto Sanyal