Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Re: XML parsing in Hadoop

Copy link to this message
RE: XML parsing in Hadoop
Vinayakumar B 2013-11-28, 09:10
Hi Chhaya,
When I see your mapreduce Job, you are still using the TextInputFormat, which reads the input file line by line and executes map() method.

You are actually doing the following things.
1.       Takes XML files as in input file.

2.       For each line of the XML file, MapReduce will call map() method.

3.       In your map() method, you are parsing the entire file and storing node key and value to output.

That means suppose if your XML have 1000 lines, then 1000 times same XML file will be parsed.
This is the reason your Job is taking lot of time.

You may need to write the custom input format to identify your input.

Thanks and Regards,

From: Chhaya Vishwakarma [mailto:[EMAIL PROTECTED]]
Sent: 28 November 2013 14:18
Subject: RE: XML parsing in Hadoop


Yes I have run it without MR it takes few seconds to run. So I think its MR issue only
I have a single node cluster its launching 4 map tasks. Trying with only one file.
Chhaya Vishwakarma

From: Mirko Kämpf [mailto:[EMAIL PROTECTED]]
Sent: Thursday, November 28, 2013 12:53 PM
Subject: Re: XML parsing in Hadoop


did you run the same code in stand alone mode without MapReduce framework?
How long takes the code in you map() function standalone?
Compare those two different times (t_0 MR mode, t_1 standalone mode) to find out
if it is a MR issue or something which comes from the xml-parser logic or the data ...

Usually it should be not that slow. But what cluster do you have and how many mappers / reducers and how many of such 2NB files do you have?

Best wishes
2013/11/28 Chhaya Vishwakarma <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
The below code parses XML file, Here the output of the code is correct but the job takes long time for completion.
It took 20 hours to parse 2MB file.
Kindly suggest what changes could be done to increase the performance.

package xml;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.*;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.log4j.Logger;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class ReadXmlMR
                static Logger log = Logger.getLogger(ReadXmlMR.class.getName());
                 public static String fileName = new String();
                 public static Document dom;
                 public void configure(JobConf job) {
         fileName = job.get("map.input.file");
                public static class Map extends Mapper<LongWritable,Text,Text,Text>

                                public void map(LongWritable key, Text value,Context context ) throws IOException, InterruptedException
                                                try {
                                                                FileSplit fileSplit = (FileSplit)context.getInputSplit();
                                                                Configuration conf = context.getConfiguration();

                                                                DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

                                                                FSDataInputStream fstream1;
                                                                Path file = fileSplit.getPath();
                                                FileSystem fs = file.getFileSystem(conf);
                                                fstream1 = fs.open(fileSplit.getPath());
                                                                DocumentBuilder db = dbf.newDocumentBuilder();
                                                                dom = db.parse(fstream1);
                                                                Element docEle = null;
                                                                docEle = dom.getDocumentElement();

                                                                XPath xpath = XPathFactory.newInstance().newXPath();

                                                                Object result =  xpath.compile("//*").evaluate(dom, XPathConstants.NODESET);

                                                                NodeList nodes = (NodeList) result;
                                                                for (int n = 2; n < nodes.getLength(); n++)

                                                                                Text colvalue=new Text("");
                                                                                Text nodename= new Text("");

                                                                                nodename = new Text(nodes.item(n).getNodeName());