Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> Hadoop and XML


Copy link to this message
-
Re: Hadoop and XML
>From an earlier post:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
[EMAIL PROTECTED]> wrote:

> Moving the variable to a local variable did not seem to work:
>
>
> </PrivateRateSet>vateRateSet>
>
>
>
> public void map(Object key, Object value, OutputCollector output, Reporter
> reporter) throws IOException {
>                Text valueText = (Text)value;
>                String valueString = new String(valueText.getBytes(),
> "UTF-8");
>                String keyString = getXmlKey(valueString);
>                 Text returnKeyText = new Text();
>                Text returnValueText = new Text();
>                returnKeyText.set(keyString);
>                returnValueText.set(valueString);
>                output.collect(returnKeyText, returnValueText);
> }
>
> -----Original Message-----
> From: Peter Minearo [mailto:[EMAIL PROTECTED]]
> Sent: Fri 7/16/2010 2:51 PM
> To: [EMAIL PROTECTED]
> Subject: RE: Hadoop and XML
>
> Whoops....right after I sent it and someone else made a suggestion; I
> realized what question 2 was about.  I can try that, but wouldn't that
> cause Object bloat?  During the Hadoop training I went through; it was
> mentioned to reuse the returning Key and Value objects to keep the
> number of Objects created down to a minimum.  Is this not really a valid
> point?
>
>
>
> -----Original Message-----
> From: Peter Minearo [mailto:[EMAIL PROTECTED]]
> Sent: Friday, July 16, 2010 2:44 PM
> To: [EMAIL PROTECTED]
> Subject: RE: Hadoop and XML
>
>
> I am not using multi-threaded Map tasks.  Also, if I understand your
> second question correctly:
> "Also can you try creating the output key and values in the map
> method(method lacal) ?"
> In the first code snippet I am doing exactly that.
>
> Below is the class that runs the Job.
>
> public class HadoopJobClient {
>
>        private static final Log LOGGER > LogFactory.getLog(Prds.class.getName());
>
>        public static void main(String[] args) {
>                JobConf conf = new JobConf(Prds.class);
>
>                conf.set("xmlinput.start", "<PrivateRateSet>");
>                conf.set("xmlinput.end", "</PrivateRateSet>");
>
>                conf.setJobName("PRDS Parse");
>
>                conf.setOutputKeyClass(Text.class);
>                conf.setOutputValueClass(Text.class);
>
>                conf.setMapperClass(PrdsMapper.class);
>                conf.setReducerClass(PrdsReducer.class);
>
>                conf.setInputFormat(XmlInputFormat.class);
>                conf.setOutputFormat(TextOutputFormat.class);
>
>                FileInputFormat.setInputPaths(conf, new Path(args[0]));
>                FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
>                // Run the job
>                try {
>                        JobClient.runJob(conf);
>                } catch (IOException e) {
>                        LOGGER.error(e.getMessage(), e);
>                }
>
>        }
>
>
> }
>
>
>
>
> -----Original Message-----
> From: Soumya Banerjee [mailto:[EMAIL PROTECTED]]
> Sent: Fri 7/16/2010 2:29 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Hadoop and XML
>
> Hi,
>
> Can you please share the code of the job submission client ?
>
> Also can you try creating the output key and values in the map
> method(method
> lacal) ?
> Make sure you are not using multi threaded map task configuration.
>
> map()
> {
> private Text keyText = new Text();
>  private Text valueText = new Text();
>
> //rest of the code
> }
>
> Soumya.
>
> On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
> [EMAIL PROTECTED]> wrote:
>
> > I have an XML file that has sparse data in it.  I am running a
> > MapReduce Job that reads in an XML file, pulls out a Key from within
> > the XML snippet and then hands back the Key and the XML snippet (as
> > the Value) to the OutputCollector.  The reason is to sort the file