-Re: HDFS splits based on content semantics
Harsh J 2012-08-01, 17:03
To add onto David's response, also read
http://search-hadoop.com/m/ydCoSysmTd1 for some more info.
On Wed, Aug 1, 2012 at 7:23 PM, David Rosenstrauch <[EMAIL PROTECTED]> wrote:
> On 08/01/2012 09:44 AM, Grandl Robert wrote:
>> Probably this question is answered many times but I could not clarify yet
>> after searching on google.
>> Does HDFS split the input solely based on fixed block size or take in
>> consideration the semantics of it ?
>> For example, if I have a binary file, or I want the block to not cut some
>> lines of text, etc. will I be able to instruct HDFS where to stop with each
>> block ?
> Hadoop can natively understand text-based data. (As long as it's in a
> one-record-per-line format.)
> It obviously does not understand custom binary formats. (E.g. Microsoft
> Word files.)
> However Hadoop does provide a framework for you to create your own binary
> formats that it can understand. There is a class in Hadoop called a
> SequenceFile which provides the capability to create binary files that are
> broken up into logical blocks that Hadoop can split on.