Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - A couple of Questions on InputFormat


Copy link to this message
-
A couple of Questions on InputFormat
Steve Lewis 2013-09-21, 19:30
Classes implementing InputFormat implement
 public List<InputSplit> getSplits(JobContext job) which a List if
InputSplits. for FileInputFormat the Splits have Path.start and End

1) When is this method called and on which JVM on Which Machine and is it
called only once?

2) Do the number of Map task correspond to the number of splits returned by
getSplits?

3) InputFormat implements a method
 RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext
context ). Is this  executed within the JVM of the Mapper on the slave
machine and does the RecordReader run within that JVM

4) The default RecordReaders read a file from the start position to the end
position emitting values in the order read. With such a reader, assume it
is reading lines of text, is it reasonable to assume that the values the
mapper received are in the same order they were found in a file? Would it,
for example, be possible for WordCount to see a word that was hyphen-
ated at the end of one line and append the first word of the next line it
sees (ignoring the case where the word is at the end of a split)