Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Processing xml documents using StreamXmlRecordReader


Copy link to this message
-
Re: Processing xml documents using StreamXmlRecordReader
Thanks Madhu. I'll do that.

Regards,
    Mohammad Tariq
On Tue, Jun 19, 2012 at 5:43 PM, madhu phatak <[EMAIL PROTECTED]> wrote:
> Seems like StreamInputFormat not yet ported to new API.That's why you are
> not able to set as InputFormatClass. You can file a  jira for this issue.
>
>
> On Tue, Jun 19, 2012 at 4:49 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
>>
>> My driver function looks like this -
>>
>> public static void main(String[] args) throws IOException,
>> InterruptedException, ClassNotFoundException {
>>                // TODO Auto-generated method stub
>>
>>                Configuration conf = new Configuration();
>>                Job job = new Job();
>>                conf.set("stream.recordreader.class",
>> "org.apache.hadoop.streaming.StreamXmlRecordReader");
>>                conf.set("stream.recordreader.begin", "<info>");
>>                conf.set("stream.recordreader.end", "</info>");
>>                job.setInputFormatClass(StreamInputFormat.class);
>>                job.setOutputKeyClass(Text.class);
>>                job.setOutputValueClass(IntWritable.class);
>>                FileInputFormat.addInputPath(job, new
>> Path("/mapin/demo.xml"));
>>                FileOutputFormat.setOutputPath(job, new
>> Path("/mapout/demo"));
>>                job.waitForCompletion(true);
>>        }
>>
>> Could you please out my mistake??
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Tue, Jun 19, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>
>> wrote:
>> > Hello Madhu,
>> >
>> >             Thanks for the response. Actually I was trying to use the
>> > new API (Job). Have you tried that. I was not able to set the
>> > InputFormat using the Job API.
>> >
>> > Regards,
>> >     Mohammad Tariq
>> >
>> >
>> > On Tue, Jun 19, 2012 at 4:28 PM, madhu phatak <[EMAIL PROTECTED]>
>> > wrote:
>> >> Hi,
>> >>  Set the following properties in driver class
>> >>
>> >>   jobConf.set("stream.recordreader.class",
>> >> "org.apache.hadoop.streaming.StreamXmlRecordReader");
>> >> jobConf.set("stream.recordreader.begin",
>> >> "start-tag");
>> >> jobConf.set("stream.recordreader.end",
>> >> "end-tag");
>> >>
>> >> jobConf.setInputFormat(StreamInputFormat,class);
>> >>
>> >>  In Mapper, xml record will come as key of type Text,so your mapper
>> >> will
>> >> look like
>> >>
>> >>   public class MyMapper<K,V>  implements Mapper<Text,Text,K,V>
>> >>
>> >>
>> >> On Tue, Jun 19, 2012 at 2:49 AM, Mohammad Tariq <[EMAIL PROTECTED]>
>> >> wrote:
>> >>>
>> >>> Hello list,
>> >>>
>> >>>        Could anyone, who has written MapReduce jobs to process xml
>> >>> documents stored in there cluster using "StreamXmlRecordReader" share
>> >>> his/her experience??...or if you can provide me some pointers
>> >>> addressing that..Many thanks.
>> >>>
>> >>> Regards,
>> >>>     Mohammad Tariq
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> https://github.com/zinnia-phatak-dev/Nectar
>> >>
>
>
>
>
> --
> https://github.com/zinnia-phatak-dev/Nectar
>