Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> A couple of Questions on InputFormat


+
Steve Lewis 2013-09-21, 19:30
+
Harsh J 2013-09-23, 11:30
+
Steve Lewis 2013-09-23, 15:34
Copy link to this message
-
Re: A couple of Questions on InputFormat
Hi,

Yes, that is right.

On Mon, Sep 23, 2013 at 9:04 PM, Steve Lewis <[EMAIL PROTECTED]> wrote:
> Thank you for your thorough answer
> The last question is essentially this - while I can write a custom input
> format to handle things like hyphens I
> could do almost the same thing in the mapper by saving any hyphenated words
> from the last line (ignoring hyphenated words that
> cross a split boundary) as long as  LineRecordReader guarantees that each
> line in the split is sent to the same mapper in the order read.
> This seems to be the case - right?
>
>
> On Mon, Sep 23, 2013 at 4:30 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>>
>> Hi,
>>
>> (I'm assuming 1.0~ MR here)
>>
>> On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis <[EMAIL PROTECTED]>
>> wrote:
>> > Classes implementing InputFormat implement
>> >  public List<InputSplit> getSplits(JobContext job) which a List if
>> > InputSplits. for FileInputFormat the Splits have Path.start and End
>> >
>> > 1) When is this method called and on which JVM on Which Machine and is
>> > it
>> > called only once?
>>
>> Called only at a client, i.e. your "hadoop jar" JVM. Called only once.
>>
>> > 2) Do the number of Map task correspond to the number of splits returned
>> > by
>> > getSplits?
>>
>> Yes, number of split objects == number of mappers.
>>
>> > 3) InputFormat implements a method
>> >  RecordReader<K,V> createRecordReader(InputSplit
>> > split,TaskAttemptContext
>> > context ). Is this  executed within the JVM of the Mapper on the slave
>> > machine and does the RecordReader run within that JVM
>>
>> RecordReaders are not created on the client side JVM. RecordReaders
>> are created on the Map task JVMs, and run inside it.
>>
>> > 4) The default RecordReaders read a file from the start position to the
>> > end
>> > position emitting values in the order read. With such a reader, assume
>> > it is
>> > reading lines of text, is it reasonable to assume that the values the
>> > mapper
>> > received are in the same order they were found in a file? Would it, for
>> > example, be possible for WordCount to see a word that was hyphen-
>> > ated at the end of one line and append the first word of the next line
>> > it
>> > sees (ignoring the case where the word is at the end of a split)
>>
>> If you speak of the LineRecordReader, each map() will simply read a
>> line, i.e. until \n. It is not language-aware to understand meaning of
>> hyphens, etc..
>>
>> You can implement a custom reader to do this however - there should be
>> no problems so long as your logic covers the case of not having any
>> duplicate reads across multiple maps.
>>
>> --
>> Harsh J
>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB