Darpan R 2013-05-21, 06:07
-Re: How to handle multiline record for inputsplit?
Harsh J 2013-05-21, 06:28
If your record is variably multi-line, then quite logically the newline
character cannot be its "record delimiter". Use the right character or
byte(s)/info that defines the real "record delimiter" and read based on
The same logic as the one described at
http://wiki.apache.org/hadoop/HadoopMapReduce for newline-delimited records
applies for your files as well.
On Tue, May 21, 2013 at 11:37 AM, Darpan R <[EMAIL PROTECTED]> wrote:
> Hi folks,
> I have a huge text file in TBs and it has multiline records. And we are not
> given that each records takes how many lines. One records can be of size 5
> lines, other may be of 6 lines another may be 4 lines. Its not sure. Line
> size may vary for each record.
> Since we cannot use default TextInputFormat, we have written own
> inputformat and a custom record reader but the confusion is :
> "When splits are happening, it is not sure if each split will contain the
> full record. Some part of record can go in split 1 and another in split 2."
> But this is not what we want.
> So, can anyone suggest how to handle this scenario so that we can guarantee
> that one full record goes in a single InputSplit ?
> Any work around or hint will be really useful.
> Thanks in advance.