Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Regarding loading a big XML file to HDFS


Copy link to this message
-
RE: Regarding loading a big XML file to HDFS
Uma Maheswara Rao G 2011-11-22, 03:03

>______________________________________
>From: hari708 [[EMAIL PROTECTED]]
>Sent: Tuesday, November 22, 2011 6:50 AM
>To: [EMAIL PROTECTED]
>Subject: Regarding loading a big XML file to HDFS

>Hi,
>I have a big file consisting of XML data.the XML is not represented as a
>single line in the file. if we stream this file using ./hadoop dfs -put
>command to a hadoop directory .How the distribution happens.?

HDFS will didvide the blocks based on your block size configured for the file.  

>Basically in My mapreduce program i am expecting a complete XML as my
>input.i have a CustomReader(for XML) in my mapreduce job configuration.My
>main confusion is if namenode distribute data to DataNodes ,there is a
>chance that a part of xml can go to one data node and other half can go in
>another datanode.If that is the case will my custom XMLReader in the
>mapreduce be able to combine it(as mapreduce reads data locally only).
>Please help me on this?

if you can not do anything parallel here, make your input split size to cover complete file size.
also configure the block size to cover complete file size. In this case, only one mapper and reducer will be spawned for file. But here you wont get any parallel processing advantage.

>--
>View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS->tp32871900p32871900.html
>Sent from the Hadoop core-user mailing list archive at Nabble.com.