Your requirment is that your M/R will use full xml file while operating.
(If it is write then please one of the approach bellow)
So you can put this xml file in DistrubutedChache which will shared
accross the M/R . So that your will get whole xml instead of chunk of data.
On Tue, May 15, 2012 at 11:30 PM, @dataElGrande <[EMAIL PROTECTED]>wrote:
> You should check out Pentaho's howto's dealing with Hadoop and MapReducer.
> Hope this helps! http://wiki.pentaho.com/display/BAD/How+To%27s
> hari708 wrote:
> > Hi,
> > I have a big file consisting of XML data.the XML is not represented as a
> > single line in the file. if we stream this file using ./hadoop dfs -put
> > command to a hadoop directory .How the distribution happens.?
> > Basically in My mapreduce program i am expecting a complete XML as my
> > input.i have a CustomReader(for XML) in my mapreduce job configuration.My
> > main confusion is if namenode distribute data to DataNodes ,there is a
> > chance that a part of xml can go to one data node and other half can go
> > another datanode.If that is the case will my custom XMLReader in the
> > mapreduce be able to combine it(as mapreduce reads data locally only).
> > Please help me on this?
> View this message in context:
> Sent from the Hadoop core-user mailing list archive at Nabble.com.