Steve Lewis 2013-09-21, 19:30
(I'm assuming 1.0~ MR here)
On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis <[EMAIL PROTECTED]> wrote:
> Classes implementing InputFormat implement
> public List<InputSplit> getSplits(JobContext job) which a List if
> InputSplits. for FileInputFormat the Splits have Path.start and End
> 1) When is this method called and on which JVM on Which Machine and is it
> called only once?
Called only at a client, i.e. your "hadoop jar" JVM. Called only once.
> 2) Do the number of Map task correspond to the number of splits returned by
Yes, number of split objects == number of mappers.
> 3) InputFormat implements a method
> RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext
> context ). Is this executed within the JVM of the Mapper on the slave
> machine and does the RecordReader run within that JVM
RecordReaders are not created on the client side JVM. RecordReaders
are created on the Map task JVMs, and run inside it.
> 4) The default RecordReaders read a file from the start position to the end
> position emitting values in the order read. With such a reader, assume it is
> reading lines of text, is it reasonable to assume that the values the mapper
> received are in the same order they were found in a file? Would it, for
> example, be possible for WordCount to see a word that was hyphen-
> ated at the end of one line and append the first word of the next line it
> sees (ignoring the case where the word is at the end of a split)
If you speak of the LineRecordReader, each map() will simply read a
line, i.e. until \n. It is not language-aware to understand meaning of
You can implement a custom reader to do this however - there should be
no problems so long as your logic covers the case of not having any
duplicate reads across multiple maps.
Steve Lewis 2013-09-23, 15:34
Harsh J 2013-09-24, 03:13