Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> XmlInputFormat Hadoop -Mapreduce


Copy link to this message
-
Re: XmlInputFormat Hadoop -Mapreduce
Hello Ranjini,

PFA the source code for XML Input Format.
Also find the output and the input which i have used.

ATTACHED FILES DESRIPTION:

(1) emp.xml --->Input Data for testing
(2)emp_op.tar.zg-->Output. Results of the map only job ( I have set
the number of reducer=0)
(3)src.tar--> the source files (Please create a project in eclipse and
paste the files ) The code is written with appropriate package and
source folder..

RUNNING THE JOB:

hadoop jar xml.jar com.xg.hadoop.training.mr.MyDriver -D
START_TAG=\<Employee\> -D END_TAG=\</Employee\> emp op

Explaination of the above command:

(1) xml.jar is the jar name which we create either through eclipse or
maven or ant

(2) com.xg.hadoop.training.mr.MyDriver  is the fully qualified driver
class name. It means that MyDriver is residing under package
com.xg.hadoop.training.mr

(3) -D START_TAG=<Employee> will not work because it will treat the
Employee as input directory which is not the case..Therefore you need
to escape them
 and thats why it is written as  -D START_TAG=\<Employee\>, you can
very well see that the two angular brackets are escaped. The similar
explanation goes for -D END_TAG
(4) emp is the input data which is present on HDFS

(5) op is the output directory which will be created as part of mapreduce job.
NOTE:
The number of reducers is explicitly set to ZERO. So this will map
reduce job will always run ZERO reducer tasks.
You need to change the driver code.

Hope this would help and you will be able to solve your problem. In
case if you face any difficulty please feel free to contact
Regards,
K Som Shekhar Sharma

+91-8197243810
Regards,
Som Shekhar Sharma
+91-8197243810
On Tue, Dec 17, 2013 at 5:42 PM, Ranjini Rathinam
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have attached the code. Please verify.
>
> Please suggest . I am using hadoop 0.20 version.
>
>
> import java.io.IOException;
> import java.util.logging.Level;
> import java.util.logging.Logger;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.NullWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.Job;
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> //import org.apache.hadoop.mapreduce.lib.input.XmlInputFormat;
>
> public class ParserDriverMain {
>
> public static void main(String[] args) {
> try {
> runJob(args[0], args[1]);
>
> } catch (IOException ex) {
> Logger.getLogger(ParserDriverMain.class.getName()).log(Level.SEVERE, null,
> ex);
> }
>
> }
>
> //The code is mostly self explanatory. You need to define the starting and
> ending tag of to split a record from the xml file and it can be defined in
> the following lines
>
> //conf.set("xmlinput.start", "<startingTag>");
> //conf.set("xmlinput.end", "</endingTag>");
>
>
> public static void runJob(String input,String output ) throws IOException {
>
> Configuration conf = new Configuration();
>
> conf.set("xmlinput.start", "<Employee>");
> conf.set("xmlinput.end", "</Employee>");
> conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
>
> Job job = new Job(conf, "jobName");
>
> input="/user/hduser/Ran/";
> output="/user/task/Sales/";
> FileInputFormat.setInputPaths(job, input);
> job.setJarByClass(ParserDriverMain.class);
> job.setMapperClass(MyParserMapper.class);
> job.setNumReduceTasks(1);
> job.setInputFormatClass(XmlInputFormatNew.class);
> job.setOutputKeyClass(NullWritable.class);
> job.setOutputValueClass(Text.class);
> Path outPath = new Path(output);
> FileOutputFormat.setOutputPath(job, outPath);
> FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
> if (dfs.exists(outPath)) {
> dfs.delete(outPath, true);
> }
>
>
> try {
>
> job.waitForCompletion(true);
>
> } catch (InterruptedException ex) {