-Getting custom input splits from files that are not byte-aligned or line-aligned
I use an application that processes text files containing data records
which are of variable size and not line-aligned.
The application implementation includes a Java library with a "reader"
object that can extract records one-by-one in a "pull" fashion, as strings,
i.e. for each such "reader" object the client code can call
and get an entire record as a String. So, proceeding in this fashion, the
client code can consume a file of arbitrarily long length, from start to
end, whereupon a null value is returned.
Another peculiarity is that the extracted record strings may lose some
secondary information (e.g., trim spaces), so exact byte alignment of the
records to the underlying data is not possible.
How could the above code be used to efficiently split compliant text files
of large size (ranging from hundreds of megabytes to several gigabytes and
terrabytes in size)?
The source code I have seen in FileInputFormat and numerous other
implementations is line or byte-aligned, so it is not applicable for the
It would actually be very useful if there was a template implementation
that left only the string record "reader" object unspecified and did
everything else, but apparently there is none.
Two alternatives that should work are:
1. Split the files outside Hadoop (e.g., to sizes less than 64 MB) and
supply them to HDFS afterwards, returning false in the isSplitable() method
of the custom InputFormat.
2. Read and write records into HDFS files in the getSplits method of
the custom InputFormat and create one FileSplit reference for each of these
HDFS files, once they are filled to the desired size.
Is there any better approach and/or any example code relevant to the above?