Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> HDFS splits based on content semantics


Copy link to this message
-
Re: HDFS splits based on content semantics
To add onto David's response, also read
http://search-hadoop.com/m/ydCoSysmTd1 for some more info.

On Wed, Aug 1, 2012 at 7:23 PM, David Rosenstrauch <[EMAIL PROTECTED]> wrote:
> On 08/01/2012 09:44 AM, Grandl Robert wrote:
>>
>> Hi,
>>
>> Probably this question is answered many times but I could not clarify yet
>> after searching on google.
>>
>>
>> Does HDFS split the input solely based on fixed block size or take in
>> consideration the semantics of it ?
>> For example, if I have a binary file, or I want the block to not cut some
>> lines of text, etc. will I be able to instruct HDFS where to stop with each
>> block ?
>>
>> Thanks,
>> Robert
>>
>
> Hadoop can natively understand text-based data.  (As long as it's in a
> one-record-per-line format.)
>
> It obviously does not understand custom binary formats.  (E.g. Microsoft
> Word files.)
>
> However Hadoop does provide a framework for you to create your own binary
> formats that it can understand.  There is a class in Hadoop called a
> SequenceFile which provides the capability to create binary files that are
> broken up into logical blocks that Hadoop can split on.
>
> HTH,
>
> DR

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB