Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # dev >> How to handle multiline record for inputsplit?


+
Darpan R 2013-05-21, 06:07
Copy link to this message
-
Re: How to handle multiline record for inputsplit?
If your record is variably multi-line, then quite logically the newline
character cannot be its "record delimiter". Use the right character or
byte(s)/info that defines the real "record delimiter" and read based on
that.

The same logic as the one described at
http://wiki.apache.org/hadoop/HadoopMapReduce for newline-delimited records
applies for your files as well.
On Tue, May 21, 2013 at 11:37 AM, Darpan R <[EMAIL PROTECTED]> wrote:

> Hi folks,
> I have a huge text file in TBs and it has multiline records. And we are not
> given that each records takes how many lines. One records can be of size 5
> lines, other may be of 6 lines another may be 4 lines. Its not sure. Line
> size may vary for each record.
> Since we cannot use default TextInputFormat, we have written own
> inputformat and a custom record reader but the confusion is :
>
> "When splits are happening, it is not sure if each split will contain the
> full record. Some part of record can go in split 1 and another in split 2."
> But this is not what we want.
>
> So, can anyone suggest how to handle this scenario so that we can guarantee
> that one full record goes in a single InputSplit ?
> Any work around or hint will be really useful.
>
> Thanks in advance.
>  DR
>

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB