Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Getting custom input splits from files that are not byte-aligned or line-aligned


Copy link to this message
-
Re: Getting custom input splits from files that are not byte-aligned or line-aligned
This appears to be the case.

My main issue is not reading the records (the library offers that
functionality) but putting them to splits.after reading (option 2 in my
original post).
On Sat, Feb 23, 2013 at 11:05 AM, Wellington Chevreuil <
[EMAIL PROTECTED]> wrote:

> Hi,
>
> I think you'll have to implement your own custom FileInputFormat, using
> this lib you mentioned to properly read your file records and split them
> through map tasks.
>
> Regards,
> Wellington.
> Em 23/02/2013 14:14, "Public Network Services" <
> [EMAIL PROTECTED]> escreveu:
>
> Hi...
>>
>> I use an application that processes text files containing data records
>> which are of variable size and not line-aligned.
>>
>> The application implementation includes a Java library with a "reader"
>> object that can extract records one-by-one in a "pull" fashion, as strings,
>> i.e. for each such "reader" object the client code can call
>>
>> reader.next()
>>
>>
>> and get an entire record as a String. So, proceeding in this fashion, the
>> client code can consume a file of arbitrarily long length, from start to
>> end, whereupon a null value is returned.
>>
>> Another peculiarity is that the extracted record strings may lose some
>> secondary information (e.g., trim spaces), so exact byte alignment of the
>> records to the underlying data is not possible.
>>
>> How could the above code be used to efficiently split compliant text
>> files of large size (ranging from hundreds of megabytes to several
>> gigabytes and terrabytes in size)?
>>
>> The source code I have seen in FileInputFormat and numerous other
>> implementations is line or byte-aligned, so it is not applicable for the
>> above case.
>>
>> It would actually be very useful if there was a template implementation
>> that left only the string record "reader" object unspecified and did
>> everything else, but apparently there is none.
>>
>> Two alternatives that should work are:
>>
>>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB)
>>    and supply them to HDFS afterwards, returning false in the isSplitable()
>>    method of the custom InputFormat.
>>    2. Read and write records into HDFS files in the getSplits[] method
>>    of the custom InputFormat and create one FileSplit reference for each of
>>    these HDFS files, once they are filled to the desired size.
>>
>> Is there any better approach and/or any example code relevant to the
>> above?
>>
>> Thanks!
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB