Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> A couple of Questions on InputFormat


Copy link to this message
-
Re: A couple of Questions on InputFormat
Thank you for your thorough answer
The last question is essentially this - while I can write a custom input
format to handle things like hyphens I
could do almost the same thing in the mapper by saving any hyphenated words
from the last line (ignoring hyphenated words that
cross a split boundary) as long as  LineRecordReader guarantees that each
line in the split is sent to the same mapper in the order read.
This seems to be the case - right?
On Mon, Sep 23, 2013 at 4:30 AM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi,
>
> (I'm assuming 1.0~ MR here)
>
> On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis <[EMAIL PROTECTED]>
> wrote:
> > Classes implementing InputFormat implement
> >  public List<InputSplit> getSplits(JobContext job) which a List if
> > InputSplits. for FileInputFormat the Splits have Path.start and End
> >
> > 1) When is this method called and on which JVM on Which Machine and is it
> > called only once?
>
> Called only at a client, i.e. your "hadoop jar" JVM. Called only once.
>
> > 2) Do the number of Map task correspond to the number of splits returned
> by
> > getSplits?
>
> Yes, number of split objects == number of mappers.
>
> > 3) InputFormat implements a method
> >  RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext
> > context ). Is this  executed within the JVM of the Mapper on the slave
> > machine and does the RecordReader run within that JVM
>
> RecordReaders are not created on the client side JVM. RecordReaders
> are created on the Map task JVMs, and run inside it.
>
> > 4) The default RecordReaders read a file from the start position to the
> end
> > position emitting values in the order read. With such a reader, assume
> it is
> > reading lines of text, is it reasonable to assume that the values the
> mapper
> > received are in the same order they were found in a file? Would it, for
> > example, be possible for WordCount to see a word that was hyphen-
> > ated at the end of one line and append the first word of the next line it
> > sees (ignoring the case where the word is at the end of a split)
>
> If you speak of the LineRecordReader, each map() will simply read a
> line, i.e. until \n. It is not language-aware to understand meaning of
> hyphens, etc..
>
> You can implement a custom reader to do this however - there should be
> no problems so long as your logic covers the case of not having any
> duplicate reads across multiple maps.
>
> --
> Harsh J
>

--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB