Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Processing xml documents using StreamXmlRecordReader


Copy link to this message
-
Re: Processing xml documents using StreamXmlRecordReader
Seems like StreamInputFormat not yet ported to new API.That's why you are
not able to set as InputFormatClass. You can file a  jira for this issue.

On Tue, Jun 19, 2012 at 4:49 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> My driver function looks like this -
>
> public static void main(String[] args) throws IOException,
> InterruptedException, ClassNotFoundException {
>                // TODO Auto-generated method stub
>
>                Configuration conf = new Configuration();
>                Job job = new Job();
>                conf.set("stream.recordreader.class",
> "org.apache.hadoop.streaming.StreamXmlRecordReader");
>                conf.set("stream.recordreader.begin", "<info>");
>                conf.set("stream.recordreader.end", "</info>");
>                job.setInputFormatClass(StreamInputFormat.class);
>                job.setOutputKeyClass(Text.class);
>                job.setOutputValueClass(IntWritable.class);
>                FileInputFormat.addInputPath(job, new
> Path("/mapin/demo.xml"));
>                FileOutputFormat.setOutputPath(job, new
> Path("/mapout/demo"));
>                job.waitForCompletion(true);
>        }
>
> Could you please out my mistake??
>
> Regards,
>     Mohammad Tariq
>
>
> On Tue, Jun 19, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>
> wrote:
> > Hello Madhu,
> >
> >             Thanks for the response. Actually I was trying to use the
> > new API (Job). Have you tried that. I was not able to set the
> > InputFormat using the Job API.
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> > On Tue, Jun 19, 2012 at 4:28 PM, madhu phatak <[EMAIL PROTECTED]>
> wrote:
> >> Hi,
> >>  Set the following properties in driver class
> >>
> >>   jobConf.set("stream.recordreader.class",
> >> "org.apache.hadoop.streaming.StreamXmlRecordReader");
> >> jobConf.set("stream.recordreader.begin",
> >> "start-tag");
> >> jobConf.set("stream.recordreader.end",
> >> "end-tag");
> >>                         jobConf.setInputFormat(StreamInputFormat,class);
> >>
> >>  In Mapper, xml record will come as key of type Text,so your mapper will
> >> look like
> >>
> >>   public class MyMapper<K,V>  implements Mapper<Text,Text,K,V>
> >>
> >>
> >> On Tue, Jun 19, 2012 at 2:49 AM, Mohammad Tariq <[EMAIL PROTECTED]>
> wrote:
> >>>
> >>> Hello list,
> >>>
> >>>        Could anyone, who has written MapReduce jobs to process xml
> >>> documents stored in there cluster using "StreamXmlRecordReader" share
> >>> his/her experience??...or if you can provide me some pointers
> >>> addressing that..Many thanks.
> >>>
> >>> Regards,
> >>>     Mohammad Tariq
> >>
> >>
> >>
> >>
> >> --
> >> https://github.com/zinnia-phatak-dev/Nectar
> >>
>

--
https://github.com/zinnia-phatak-dev/Nectar
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB