Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Regarding loading a big XML file to HDFS


Copy link to this message
-
RE: Regarding loading a big XML file to HDFS

>______________________________________
>From: hari708 [[EMAIL PROTECTED]]
>Sent: Tuesday, November 22, 2011 6:50 AM
>To: [EMAIL PROTECTED]
>Subject: Regarding loading a big XML file to HDFS

>Hi,
>I have a big file consisting of XML data.the XML is not represented as a
>single line in the file. if we stream this file using ./hadoop dfs -put
>command to a hadoop directory .How the distribution happens.?

HDFS will didvide the blocks based on your block size configured for the file.  

>Basically in My mapreduce program i am expecting a complete XML as my
>input.i have a CustomReader(for XML) in my mapreduce job configuration.My
>main confusion is if namenode distribute data to DataNodes ,there is a
>chance that a part of xml can go to one data node and other half can go in
>another datanode.If that is the case will my custom XMLReader in the
>mapreduce be able to combine it(as mapreduce reads data locally only).
>Please help me on this?

if you can not do anything parallel here, make your input split size to cover complete file size.
also configure the block size to cover complete file size. In this case, only one mapper and reducer will be spawned for file. But here you wont get any parallel processing advantage.

>--
>View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS->tp32871900p32871900.html
>Sent from the Hadoop core-user mailing list archive at Nabble.com.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB