Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Regarding loading a big XML file to HDFS


Copy link to this message
-
RE: Regarding loading a big XML file to HDFS

Just wanted to address this:
> >Basically in My mapreduce program i am expecting a complete XML as my
> >input.i have a CustomReader(for XML) in my mapreduce job configuration.My
> >main confusion is if namenode distribute data to DataNodes ,there is a
> >chance that a part of xml can go to one data node and other half can go in
> >another datanode.If that is the case will my custom XMLReader in the
> >mapreduce be able to combine it(as mapreduce reads data locally only).
> >Please help me on this?
>
> if you can not do anything parallel here, make your input split size to cover complete file size.
>
 also configure the block size to cover complete file size. In this
case, only one mapper and reducer will be spawned for file. But here you
 wont get any parallel processing advantage.
>

You can do this in parallel.
You need to write a custom input format class. (Which is what you're already doing...)

Lets see if I can explain this correctly.
You have an XML record split across block A and block B.

Your map reduce job will instantiate a task per block.
So in mapper processing block A, you read and process the XML records... when you get to the last record, which is only in part of A, mapper A will continue on to block B and continue reading the last record. Then stops.
In mapper for block B, the reader will skip and not process data until it sees the start of a record. So you end up getting all of your XML records processed (no duplication) and done in parallel.

Does that make sense?

-Mike
> Date: Tue, 22 Nov 2011 03:08:20 +0000
> From: [EMAIL PROTECTED]
> Subject: RE: Regarding loading a big XML file to HDFS
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
>
> Also i am surprising, how you are writing mapreduce application here. Map and reduce will work with key value pairs.
> ________________________________________
> From: Uma Maheswara Rao G
> Sent: Tuesday, November 22, 2011 8:33 AM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: RE: Regarding loading a big XML file to HDFS
>
> >______________________________________
> >From: hari708 [[EMAIL PROTECTED]]
> >Sent: Tuesday, November 22, 2011 6:50 AM
> >To: [EMAIL PROTECTED]
> >Subject: Regarding loading a big XML file to HDFS
>
> >Hi,
> >I have a big file consisting of XML data.the XML is not represented as a
> >single line in the file. if we stream this file using ./hadoop dfs -put
> >command to a hadoop directory .How the distribution happens.?
>
> HDFS will didvide the blocks based on your block size configured for the file.
>
> >Basically in My mapreduce program i am expecting a complete XML as my
> >input.i have a CustomReader(for XML) in my mapreduce job configuration.My
> >main confusion is if namenode distribute data to DataNodes ,there is a
> >chance that a part of xml can go to one data node and other half can go in
> >another datanode.If that is the case will my custom XMLReader in the
> >mapreduce be able to combine it(as mapreduce reads data locally only).
> >Please help me on this?
>
> if you can not do anything parallel here, make your input split size to cover complete file size.
> also configure the block size to cover complete file size. In this case, only one mapper and reducer will be spawned for file. But here you wont get any parallel processing advantage.
>
> >--
> >View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS->tp32871900p32871900.html
> >Sent from the Hadoop core-user mailing list archive at Nabble.com.
>