Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> FileInnputFormat, FileSplit, and LineRecorder: where are they run?


Copy link to this message
-
FileInnputFormat, FileSplit, and LineRecorder: where are they run?
Hello All,
In order to get a better understanding of Hadoop, i've started reading
the source and have a question
The FileInputFormat, reads in files, splits into splitsizes (which may
be bigger than block size) and creates FileSplits.
The FileSplits contain the start, length *and* the locations of the split.
The LineRecordReader, receives a split and emits records.

So far I think i'm correct(hopefully). Now, my questions
Does the LineRecordReader run on a machine, in some sense, closest to
the location of the splits? i.e
Q1: If the split is less than the block size, then the split is
located on one machine (apart from replicates): does the
LineRecordReader run on the machine which contains the split? Or at
least attempt to?
Q2. If a split is greater than the  block size, it spans multiple
blocks which could reside on more than 1 machine. In this case, on
which machine does the LineRecordReader run? The machine 'closest' to
them?

Please correct me if i'm wrong.
Thank you
Saptarshi
--
Saptarshi Guha - [EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB