|
|
-
Getting custom input splits from files that are not byte-aligned or line-alignedPublic Network Services 2013-02-23, 14:13
Hi...
I use an application that processes text files containing data records which are of variable size and not line-aligned. The application implementation includes a Java library with a "reader" object that can extract records one-by-one in a "pull" fashion, as strings, i.e. for each such "reader" object the client code can call reader.next() and get an entire record as a String. So, proceeding in this fashion, the client code can consume a file of arbitrarily long length, from start to end, whereupon a null value is returned. Another peculiarity is that the extracted record strings may lose some secondary information (e.g., trim spaces), so exact byte alignment of the records to the underlying data is not possible. How could the above code be used to efficiently split compliant text files of large size (ranging from hundreds of megabytes to several gigabytes and terrabytes in size)? The source code I have seen in FileInputFormat and numerous other implementations is line or byte-aligned, so it is not applicable for the above case. It would actually be very useful if there was a template implementation that left only the string record "reader" object unspecified and did everything else, but apparently there is none. Two alternatives that should work are: 1. Split the files outside Hadoop (e.g., to sizes less than 64 MB) and supply them to HDFS afterwards, returning false in the isSplitable() method of the custom InputFormat. 2. Read and write records into HDFS files in the getSplits[] method of the custom InputFormat and create one FileSplit reference for each of these HDFS files, once they are filled to the desired size. Is there any better approach and/or any example code relevant to the above? Thanks! |