-Calculation of BYTES_READ counter in TrackedRecordReader can give incorrect result
Subroto 2012-06-21, 22:53
I have a RecordReader implementation which reads the records asynchronously and caches them in memory(In a BlockingQueue).
When TrackingRecordReader calls for next Record, the internal implementation of RecordReader reads from the queue and supplies the record to MapTask.
The TrackingRecordReader increments the BYTES_READ counter by calculating:
bytesInCurr - bytesInPrev
where bytesIncurr is FSStatistics byte read after the call to next and bytesInPrev is before call to next.
As the records are already read before making a call to next most of the time bytesInCurr - bytesInPrev results to zero or some other value if the Asynchronous Thread is running in background.
Earlier the BYTES_READ counter was handled by getPos() method which my RecordReader use to take care properly.
Would like to get opinion if the current behavior of calculating BYTES_READ in TrackingRecordReader is correct as it compels the user to read the records in synchronous fashion.
Please let me know if there is any workaround for getting the correct statistics from the MR job.