I don't quite get what you mean - we don't have such a flaw. The first
split task makes sure to read one extra record, even if its last byte
is a newline. The subsequent splits (that is, those with offsets not
0), always ignore the first record even if it is complete in their
You may read the implementation by following the sources I've linked
from similar questions asked in past.
On Fri, Jan 25, 2013 at 6:07 AM, Praveen Sripati
<[EMAIL PROTECTED]> wrote:
> Thanks for the response.
> From http://wiki.apache.org/hadoop/HadoopMapReduce
>>For example TextInputFormat will read the last line of the FileSplit past
>> the split boundary and when reading other than the first FileSplit,
>> TextInputFormat ignores the content up to the first newline.
> When the first record in the splits other than the first split is complete
> and not spanning boundaries, then based on the above logic this particular
> record is not processed by the mapper.
> Cloudera Certified Developer for Apache Hadoop CDH4 (95%)
> If you aren’t taking advantage of big data, then you don’t have big data,
> you have just a pile of data.
> On Fri, Jan 25, 2013 at 12:52 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>> Hi Praveen,
>> This is explained at http://wiki.apache.org/hadoop/HadoopMapReduce
>> [Map section].
>> On Thu, Jan 24, 2013 at 10:20 PM, Praveen Sripati
>> <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> > HDFS splits the file across record boundaries. So, how does the mapper
>> > processing the second block (b2) determine that the first record is
>> > incomplete and should process starting from the second record in the
>> > block
>> > (b2)?
>> > Thanks,
>> > Praveen
>> Harsh J