Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> XML to TEXT


I suggest to use the XPath, this is a native java support for parse xml and
json formats.

For the main problem, like distcp command(
http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
reduce function, because you can parse the xml input file and create the
file you need in the map function.For example the following code reads an
xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
expected format:
Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends Mapper<LongWritable, Text, Text, Text>

    private static final XPathFactory xpathFactory XPathFactory.newInstance();

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        String resultFileName = "/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));

        InputStream resultIS = new ByteArrayInputStream(new byte[0]);

        String header = "id,name\n";

        String xmlContent = value.toString();
        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
            DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,

            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++) {
                    line += nodeList.item(j).getTextContent() + ",";
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";


        } catch (ParserConfigurationException e) {
            MyLogguer.log("error: " + e.getMessage());
        } catch (SAXException e) {
            MyLogguer.log("error: " + e.getMessage());
        } catch (XPathExpressionException e) {
            MyLogguer.log("error: " + e.getMessage());

        IOUtils.copyBytes(resultIS, out, 4096, true);

    public static Object getNode(String xpathStr, Node node, QName
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);

Main class:
public class Main {

    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
                    .println("Usage: XMLtoText <input path> <output path>");

        Job job = new Job();
        job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);


To execute the job you can use :

         bin/hadoop Main /data.xml /output.
Then you can use this to see result.txt file:

          hadoop fs -cat /result.txt
I'm using this xml as input:


and the content in result.txt is like this:

Hope this helps.
2014/1/3 Ranjini Rathinam <[EMAIL PROTECTED]>