-Re: Getting custom input splits from files that are not byte-aligned or line-aligned
Public Network Services 2013-02-23, 19:40
This appears to be the case.
My main issue is not reading the records (the library offers that
functionality) but putting them to splits.after reading (option 2 in my
On Sat, Feb 23, 2013 at 11:05 AM, Wellington Chevreuil <
[EMAIL PROTECTED]> wrote:
> I think you'll have to implement your own custom FileInputFormat, using
> this lib you mentioned to properly read your file records and split them
> through map tasks.
> Em 23/02/2013 14:14, "Public Network Services" <
> [EMAIL PROTECTED]> escreveu:
>> I use an application that processes text files containing data records
>> which are of variable size and not line-aligned.
>> The application implementation includes a Java library with a "reader"
>> object that can extract records one-by-one in a "pull" fashion, as strings,
>> i.e. for each such "reader" object the client code can call
>> and get an entire record as a String. So, proceeding in this fashion, the
>> client code can consume a file of arbitrarily long length, from start to
>> end, whereupon a null value is returned.
>> Another peculiarity is that the extracted record strings may lose some
>> secondary information (e.g., trim spaces), so exact byte alignment of the
>> records to the underlying data is not possible.
>> How could the above code be used to efficiently split compliant text
>> files of large size (ranging from hundreds of megabytes to several
>> gigabytes and terrabytes in size)?
>> The source code I have seen in FileInputFormat and numerous other
>> implementations is line or byte-aligned, so it is not applicable for the
>> above case.
>> It would actually be very useful if there was a template implementation
>> that left only the string record "reader" object unspecified and did
>> everything else, but apparently there is none.
>> Two alternatives that should work are:
>> 1. Split the files outside Hadoop (e.g., to sizes less than 64 MB)
>> and supply them to HDFS afterwards, returning false in the isSplitable()
>> method of the custom InputFormat.
>> 2. Read and write records into HDFS files in the getSplits method
>> of the custom InputFormat and create one FileSplit reference for each of
>> these HDFS files, once they are filled to the desired size.
>> Is there any better approach and/or any example code relevant to the