Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> A couple of Questions on InputFormat


Copy link to this message
-
Re: A couple of Questions on InputFormat
Hi,

(I'm assuming 1.0~ MR here)

On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis <[EMAIL PROTECTED]> wrote:
> Classes implementing InputFormat implement
>  public List<InputSplit> getSplits(JobContext job) which a List if
> InputSplits. for FileInputFormat the Splits have Path.start and End
>
> 1) When is this method called and on which JVM on Which Machine and is it
> called only once?

Called only at a client, i.e. your "hadoop jar" JVM. Called only once.

> 2) Do the number of Map task correspond to the number of splits returned by
> getSplits?

Yes, number of split objects == number of mappers.

> 3) InputFormat implements a method
>  RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext
> context ). Is this  executed within the JVM of the Mapper on the slave
> machine and does the RecordReader run within that JVM

RecordReaders are not created on the client side JVM. RecordReaders
are created on the Map task JVMs, and run inside it.

> 4) The default RecordReaders read a file from the start position to the end
> position emitting values in the order read. With such a reader, assume it is
> reading lines of text, is it reasonable to assume that the values the mapper
> received are in the same order they were found in a file? Would it, for
> example, be possible for WordCount to see a word that was hyphen-
> ated at the end of one line and append the first word of the next line it
> sees (ignoring the case where the word is at the end of a split)

If you speak of the LineRecordReader, each map() will simply read a
line, i.e. until \n. It is not language-aware to understand meaning of
hyphens, etc..

You can implement a custom reader to do this however - there should be
no problems so long as your logic covers the case of not having any
duplicate reads across multiple maps.

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB