Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: Getting custom input splits from files that are not byte-aligned or line-aligned


+
Wellington Chevreuil 2013-02-23, 19:05
Copy link to this message
-
Re: Getting custom input splits from files that are not byte-aligned or line-aligned
This appears to be the case.

My main issue is not reading the records (the library offers that
functionality) but putting them to splits.after reading (option 2 in my
original post).
On Sat, Feb 23, 2013 at 11:05 AM, Wellington Chevreuil <
[EMAIL PROTECTED]> wrote:

> Hi,
>
> I think you'll have to implement your own custom FileInputFormat, using
> this lib you mentioned to properly read your file records and split them
> through map tasks.
>
> Regards,
> Wellington.
> Em 23/02/2013 14:14, "Public Network Services" <
> [EMAIL PROTECTED]> escreveu:
>
> Hi...
>>
>> I use an application that processes text files containing data records
>> which are of variable size and not line-aligned.
>>
>> The application implementation includes a Java library with a "reader"
>> object that can extract records one-by-one in a "pull" fashion, as strings,
>> i.e. for each such "reader" object the client code can call
>>
>> reader.next()
>>
>>
>> and get an entire record as a String. So, proceeding in this fashion, the
>> client code can consume a file of arbitrarily long length, from start to
>> end, whereupon a null value is returned.
>>
>> Another peculiarity is that the extracted record strings may lose some
>> secondary information (e.g., trim spaces), so exact byte alignment of the
>> records to the underlying data is not possible.
>>
>> How could the above code be used to efficiently split compliant text
>> files of large size (ranging from hundreds of megabytes to several
>> gigabytes and terrabytes in size)?
>>
>> The source code I have seen in FileInputFormat and numerous other
>> implementations is line or byte-aligned, so it is not applicable for the
>> above case.
>>
>> It would actually be very useful if there was a template implementation
>> that left only the string record "reader" object unspecified and did
>> everything else, but apparently there is none.
>>
>> Two alternatives that should work are:
>>
>>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB)
>>    and supply them to HDFS afterwards, returning false in the isSplitable()
>>    method of the custom InputFormat.
>>    2. Read and write records into HDFS files in the getSplits[] method
>>    of the custom InputFormat and create one FileSplit reference for each of
>>    these HDFS files, once they are filled to the desired size.
>>
>> Is there any better approach and/or any example code relevant to the
>> above?
>>
>> Thanks!
>>
>