-RE: Does FileSplit respect the record boundary?
Vinayakumar B 2012-02-11, 07:19
Ø The LineRecordReader will get the path in the HDFS itself, not on the
But its the NameNode who gives the list of DataNodes for a particular
block, sorted by the Distance from the Client. i.e. Here Machine where Task
Ø For the line which ends in next block, HDFS only will take care of
getting the next block information from NameNode and give it to LineReader.
Line Reader will just continue reading without worrying about the location
of the block.
o One Suggestion to get the better performance is set the split size for
the job same as the block size of the input file. If the split size is more
than the block size then Task may need to get the block data from multiple
Thanks and Regards,
From: GUOJUN Zhu [mailto:[EMAIL PROTECTED]]
Sent: Saturday, February 11, 2012 3:50 AM
To: [EMAIL PROTECTED]
Subject: Re: Does FileSplit respect the record boundary?
Thank you for the reply. That page helps a lot. I still have a more
specific question. In a LineRecordReader's constructor (hadoop 1.0.0)
public LineRecordReader(Configuration job, FileSplit split). Does a call
"final Path file = split.getPath()" return the logical file in HDFS or just
the real local file cressponding the block in the local file system? If it
is the previous case, how can we make sure the later call "FSDataInputStream
fileIn = fs.open(split.getPath()); in = new LineReader(fileIn, job);" gives
the block residing in the same local node instead of a replica in the other
node? If it is the later case, ("split.getPath()" giving the local file),
how can we get the inputstream handler to read the next split for an extra
line when reaching the end of the split? Thanks.
Modeling Sr Graduate
Harsh J <[EMAIL PROTECTED]>
02/10/2012 12:02 PM
Please respond to
Re: Does FileSplit respect the record boundary?
Please read the map section of
http://wiki.apache.org/hadoop/HadoopMapReduce to understand how Hadoop
ends up respecting record boundaries despite block-chops not taking
that into consideration. I hope it helps clear things up for you.
On Fri, Feb 10, 2012 at 10:26 PM, GUOJUN Zhu <[EMAIL PROTECTED]>
> I am learning Hadoop. We have some special formated text file for input,
> we need to write some customized inputFormat, probably based on
> FileInputFormat. Does the FileInputFormat respect the record boundary
> (every line or maybe every other line)? I am reading the source code
> (1.0.0). For example in the LineRecordReader, is "in" field (InputStream)
> of the LineReader(in,..) the full HDFS file (of many blocks) or just the
> real local file of one block? All books I read have very little details
> about it. Can any expert point me to some reference about it, or maybe
> which part of the source code I should concentrate on? Thanks.
> Zhu, Guojun
> Modeling Sr Graduate
> [EMAIL PROTECTED]
> Financial Engineering
> Freddie Mac
Customer Ops. Engineer
Cloudera | <http://tiny.cloudera.com/about> http://tiny.cloudera.com/about