Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> HDFS splits based on content semantics


+
Grandl Robert 2012-08-01, 13:44
+
David Rosenstrauch 2012-08-01, 13:53
+
Harsh J 2012-08-01, 17:03
Copy link to this message
-
Re: HDFS splits based on content semantics
Thank you guys.

Really helpful.

________________________________
 From: Harsh J <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Wednesday, August 1, 2012 1:03 PM
Subject: Re: HDFS splits based on content semantics
 
To add onto David's response, also read
http://search-hadoop.com/m/ydCoSysmTd1 for some more info.

On Wed, Aug 1, 2012 at 7:23 PM, David Rosenstrauch <[EMAIL PROTECTED]> wrote:
> On 08/01/2012 09:44 AM, Grandl Robert wrote:
>>
>> Hi,
>>
>> Probably this question is answered many times but I could not clarify yet
>> after searching on google.
>>
>>
>> Does HDFS split the input solely based on fixed block size or take in
>> consideration the semantics of it ?
>> For example, if I have a binary file, or I want the block to not cut some
>> lines of text, etc. will I be able to instruct HDFS where to stop with each
>> block ?
>>
>> Thanks,
>> Robert
>>
>
> Hadoop can natively understand text-based data.  (As long as it's in a
> one-record-per-line format.)
>
> It obviously does not understand custom binary formats.  (E.g. Microsoft
> Word files.)
>
> However Hadoop does provide a framework for you to create your own binary
> formats that it can understand.  There is a class in Hadoop called a
> SequenceFile which provides the capability to create binary files that are
> broken up into logical blocks that Hadoop can split on.
>
> HTH,
>
> DR

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB