-FileInnputFormat, FileSplit, and LineRecorder: where are they run?
In order to get a better understanding of Hadoop, i've started reading
the source and have a question
The FileInputFormat, reads in files, splits into splitsizes (which may
be bigger than block size) and creates FileSplits.
The FileSplits contain the start, length *and* the locations of the split.
The LineRecordReader, receives a split and emits records.
So far I think i'm correct(hopefully). Now, my questions
Does the LineRecordReader run on a machine, in some sense, closest to
the location of the splits? i.e
Q1: If the split is less than the block size, then the split is
located on one machine (apart from replicates): does the
LineRecordReader run on the machine which contains the split? Or at
least attempt to?
Q2. If a split is greater than the block size, it spans multiple
blocks which could reside on more than 1 machine. In this case, on
which machine does the LineRecordReader run? The machine 'closest' to
Please correct me if i'm wrong.
Saptarshi Guha - [EMAIL PROTECTED]