Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Regarding loading a big XML file to HDFS


Copy link to this message
-
Re: Regarding loading a big XML file to HDFS
If your file is bigger than a block size (typically 64mb or 128mb), then it will be split into more than one block. The blocks may or may not be stored on different datanodes. If you're using a default InputFormat, then the input will be split between two task. Since you said you need the whole file in order to process it, you should use either a custom InputFormat that doesn't split or use something like WholeFileInputFormat which returns the whole file s a single record.

-Joey

On Nov 21, 2011, at 20:20, hari708 <[EMAIL PROTECTED]> wrote:

>
> Hi,
> I have a big file consisting of XML data.the XML is not represented as a
> single line in the file. if we stream this file using ./hadoop dfs -put
> command to a hadoop directory .How the distribution happens.?
> Basically in My mapreduce program i am expecting a complete XML as my
> input.i have a CustomReader(for XML) in my mapreduce job configuration.My
> main confusion is if namenode distribute data to DataNodes ,there is a
> chance that a part of xml can go to one data node and other half can go in
> another datanode.If that is the case will my custom XMLReader in the
> mapreduce be able to combine it(as mapreduce reads data locally only).
> Please help me on this?
> --
> View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS-tp32871901p32871901.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB