Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> XML parsing in Hadoop

Copy link to this message
Re: XML parsing in Hadoop
Hello Chhaya,

I'm not sure why the job launches 4 map tasks, since your input file's size is 2MB, which is less than 1 HDFS block (64MB by default) - I would expect to initialize only 1 mapper, unless you have changed the default HDFS block size value.

As I see in the code, you use TextInputFormat.class to read your input file. This means that your map function will be executed once per line of your input. However, inside your map function you still read all the input split:
FileSplit fileSplit = (FileSplit)context.getInputSplit();
This means that if you have many lines in your input (I guess you do), you read multiple time the same input split, which I suspect is wrong?
Moreover, you might want to revise the line
if ( colvalue.toString().equalsIgnoreCase(null) ) .
Do you mean
if ( colvalue==null) ?

I think it would be helpful to read once more the MapReduce programming model, in order to better understand when each map & reduce function is executed and how. You can use this link  http://developer.yahoo.com/hadoop/tutorial/module4.html , or the official Apache Hadoop website.
This will help you fit your algorithm in the MapReduce paradigm more easily. If you need further clarifications, I would be happy to help!


On Thursday, November 28, 2013 11:03 AM, Chhaya Vishwakarma <[EMAIL PROTECTED]> wrote:
2mB file

>From:unmesha sreeveni [mailto:[EMAIL PROTECTED]]
>Sent: Thursday, November 28, 2013 2:23 PM
>To: User Hadoop
>Subject: Re: XML parsing in Hadoop

>How much is ur size of input file?

>On Thu, Nov 28, 2013 at 2:17 PM, Chhaya Vishwakarma <[EMAIL PROTECTED]> wrote:

>Yes I have run it without MR it takes few seconds to run. So I think its MR issue only
>I have a single node cluster its launching 4 map tasks. Trying with only one file.

>Chhaya Vishwakarma

>From:Mirko Kämpf [mailto:[EMAIL PROTECTED]]
>Sent: Thursday, November 28, 2013 12:53 PM
>Subject: Re: XML parsing in Hadoop


>did you run the same code in stand alone mode without MapReduce framework?
>How long takes the code in you map() function standalone? 
>Compare those two different times (t_0 MR mode, t_1 standalone mode) to find out 
>if it is a MR issue or something which comes from the xml-parser logic or the data ...

>Usually it should be not that slow. But what cluster do you have and how many mappers / reducers and how many of such 2NB files do you have?

>Best wishes

>2013/11/28 Chhaya Vishwakarma <[EMAIL PROTECTED]>

>The below code parses XML file, Here the output of the code is correct but the job takes long time for completion.
>It took 20 hours to parse 2MB file.
>Kindly suggest what changes could be done to increase the performance.

>package xml;

>import java.io.FileInputStream;
>import java.io.FileNotFoundException;
>import java.io.IOException;
>import java.util.*;

>import javax.xml.parsers.DocumentBuilder;
>import javax.xml.parsers.DocumentBuilderFactory;
>import javax.xml.parsers.ParserConfigurationException;
>import javax.xml.xpath.XPath;
>import javax.xml.xpath.XPathConstants;
>import javax.xml.xpath.XPathExpressionException;
>import javax.xml.xpath.XPathFactory;
>import org.apache.hadoop.fs.FSDataInputStream;
>import org.apache.hadoop.fs.FSInputStream;
>import org.apache.hadoop.fs.FileSystem;
>import org.apache.hadoop.fs.Path;

>import org.apache.hadoop.conf.*;
>import org.apache.hadoop.io.*;

>import org.apache.hadoop.mapred.JobConf;
>import org.apache.hadoop.mapreduce.*;
>import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
>import org.apache.hadoop.mapreduce.lib.input.FileSplit;
>import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
>import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
>import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

>import org.apache.log4j.Logger;
>import org.w3c.dom.Document;