Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # dev - How to handle multiline record for inputsplit?


Copy link to this message
-
How to handle multiline record for inputsplit?
Darpan R 2013-05-21, 06:07
Hi folks,
I have a huge text file in TBs and it has multiline records. And we are not
given that each records takes how many lines. One records can be of size 5
lines, other may be of 6 lines another may be 4 lines. Its not sure. Line
size may vary for each record.
Since we cannot use default TextInputFormat, we have written own
inputformat and a custom record reader but the confusion is :

"When splits are happening, it is not sure if each split will contain the
full record. Some part of record can go in split 1 and another in split 2."
But this is not what we want.

So, can anyone suggest how to handle this scenario so that we can guarantee
that one full record goes in a single InputSplit ?
Any work around or hint will be really useful.

Thanks in advance.
 DR
+
Harsh J 2013-05-21, 06:28