Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Processing xml documents using StreamXmlRecordReader


Copy link to this message
-
Re: Processing xml documents using StreamXmlRecordReader
Mohammad Tariq 2012-06-19, 12:28
But I have downloaded "hadoop-streaming-0.20.205.0.jar" and it
contains StreamXmlRecordReader.class file. This means it should
support StreamInputFormat.

Regards,
    Mohammad Tariq
On Tue, Jun 19, 2012 at 5:54 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
> Thanks Madhu. I'll do that.
>
> Regards,
>     Mohammad Tariq
>
>
> On Tue, Jun 19, 2012 at 5:43 PM, madhu phatak <[EMAIL PROTECTED]> wrote:
>> Seems like StreamInputFormat not yet ported to new API.That's why you are
>> not able to set as InputFormatClass. You can file a  jira for this issue.
>>
>>
>> On Tue, Jun 19, 2012 at 4:49 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
>>>
>>> My driver function looks like this -
>>>
>>> public static void main(String[] args) throws IOException,
>>> InterruptedException, ClassNotFoundException {
>>>                // TODO Auto-generated method stub
>>>
>>>                Configuration conf = new Configuration();
>>>                Job job = new Job();
>>>                conf.set("stream.recordreader.class",
>>> "org.apache.hadoop.streaming.StreamXmlRecordReader");
>>>                conf.set("stream.recordreader.begin", "<info>");
>>>                conf.set("stream.recordreader.end", "</info>");
>>>                job.setInputFormatClass(StreamInputFormat.class);
>>>                job.setOutputKeyClass(Text.class);
>>>                job.setOutputValueClass(IntWritable.class);
>>>                FileInputFormat.addInputPath(job, new
>>> Path("/mapin/demo.xml"));
>>>                FileOutputFormat.setOutputPath(job, new
>>> Path("/mapout/demo"));
>>>                job.waitForCompletion(true);
>>>        }
>>>
>>> Could you please out my mistake??
>>>
>>> Regards,
>>>     Mohammad Tariq
>>>
>>>
>>> On Tue, Jun 19, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>
>>> wrote:
>>> > Hello Madhu,
>>> >
>>> >             Thanks for the response. Actually I was trying to use the
>>> > new API (Job). Have you tried that. I was not able to set the
>>> > InputFormat using the Job API.
>>> >
>>> > Regards,
>>> >     Mohammad Tariq
>>> >
>>> >
>>> > On Tue, Jun 19, 2012 at 4:28 PM, madhu phatak <[EMAIL PROTECTED]>
>>> > wrote:
>>> >> Hi,
>>> >>  Set the following properties in driver class
>>> >>
>>> >>   jobConf.set("stream.recordreader.class",
>>> >> "org.apache.hadoop.streaming.StreamXmlRecordReader");
>>> >> jobConf.set("stream.recordreader.begin",
>>> >> "start-tag");
>>> >> jobConf.set("stream.recordreader.end",
>>> >> "end-tag");
>>> >>
>>> >> jobConf.setInputFormat(StreamInputFormat,class);
>>> >>
>>> >>  In Mapper, xml record will come as key of type Text,so your mapper
>>> >> will
>>> >> look like
>>> >>
>>> >>   public class MyMapper<K,V>  implements Mapper<Text,Text,K,V>
>>> >>
>>> >>
>>> >> On Tue, Jun 19, 2012 at 2:49 AM, Mohammad Tariq <[EMAIL PROTECTED]>
>>> >> wrote:
>>> >>>
>>> >>> Hello list,
>>> >>>
>>> >>>        Could anyone, who has written MapReduce jobs to process xml
>>> >>> documents stored in there cluster using "StreamXmlRecordReader" share
>>> >>> his/her experience??...or if you can provide me some pointers
>>> >>> addressing that..Many thanks.
>>> >>>
>>> >>> Regards,
>>> >>>     Mohammad Tariq
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> https://github.com/zinnia-phatak-dev/Nectar
>>> >>
>>
>>
>>
>>
>> --
>> https://github.com/zinnia-phatak-dev/Nectar
>>