Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # general - Hadoop and XML


+
Peter Minearo 2010-07-16, 21:00
+
Soumya Banerjee 2010-07-16, 21:29
+
Peter Minearo 2010-07-16, 21:44
+
Peter Minearo 2010-07-16, 21:51
+
Peter Minearo 2010-07-16, 22:07
+
Ted Yu 2010-07-17, 04:43
+
Peter Minearo 2010-07-19, 15:01
Copy link to this message
-
Re: Hadoop and XML
Ted Yu 2010-07-19, 16:08
For your initial question on Text.set().
Text.setCapacity() allocates new byte array. Since keepData is false, old
data wouldn't be copied over.

On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
[EMAIL PROTECTED]> wrote:

> I am already using XmlInputFormat.  The input into the Map phase is not
> the problem.  The problem lays in between the Map and Reduce phase.
>
> BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> XmlInputFormat is a lot faster.  From my testing, StreamXmlRecordReader
> took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was
> under 2 minutes. (Using 2 Core, 8GB machines)
>
>
> -----Original Message-----
> From: Ted Yu [mailto:[EMAIL PROTECTED]]
> Sent: Friday, July 16, 2010 9:44 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Hadoop and XML
>
> From an earlier post:
> http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
>
> On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> [EMAIL PROTECTED]> wrote:
>
> > Moving the variable to a local variable did not seem to work:
> >
> >
> > </PrivateRateSet>vateRateSet>
> >
> >
> >
> > public void map(Object key, Object value, OutputCollector output,
> > Reporter
> > reporter) throws IOException {
> >                Text valueText = (Text)value;
> >                String valueString = new String(valueText.getBytes(),
> > "UTF-8");
> >                String keyString = getXmlKey(valueString);
> >                 Text returnKeyText = new Text();
> >                Text returnValueText = new Text();
> >                returnKeyText.set(keyString);
> >                returnValueText.set(valueString);
> >                output.collect(returnKeyText, returnValueText); }
> >
> > -----Original Message-----
> > From: Peter Minearo [mailto:[EMAIL PROTECTED]]
> > Sent: Fri 7/16/2010 2:51 PM
> > To: [EMAIL PROTECTED]
> > Subject: RE: Hadoop and XML
> >
> > Whoops....right after I sent it and someone else made a suggestion; I
> > realized what question 2 was about.  I can try that, but wouldn't that
>
> > cause Object bloat?  During the Hadoop training I went through; it was
>
> > mentioned to reuse the returning Key and Value objects to keep the
> > number of Objects created down to a minimum.  Is this not really a
> > valid point?
> >
> >
> >
> > -----Original Message-----
> > From: Peter Minearo [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, July 16, 2010 2:44 PM
> > To: [EMAIL PROTECTED]
> > Subject: RE: Hadoop and XML
> >
> >
> > I am not using multi-threaded Map tasks.  Also, if I understand your
> > second question correctly:
> > "Also can you try creating the output key and values in the map
> > method(method lacal) ?"
> > In the first code snippet I am doing exactly that.
> >
> > Below is the class that runs the Job.
> >
> > public class HadoopJobClient {
> >
> >        private static final Log LOGGER > > LogFactory.getLog(Prds.class.getName());
> >
> >        public static void main(String[] args) {
> >                JobConf conf = new JobConf(Prds.class);
> >
> >                conf.set("xmlinput.start", "<PrivateRateSet>");
> >                conf.set("xmlinput.end", "</PrivateRateSet>");
> >
> >                conf.setJobName("PRDS Parse");
> >
> >                conf.setOutputKeyClass(Text.class);
> >                conf.setOutputValueClass(Text.class);
> >
> >                conf.setMapperClass(PrdsMapper.class);
> >                conf.setReducerClass(PrdsReducer.class);
> >
> >                conf.setInputFormat(XmlInputFormat.class);
> >                conf.setOutputFormat(TextOutputFormat.class);
> >
> >                FileInputFormat.setInputPaths(conf, new Path(args[0]));
> >                FileOutputFormat.setOutputPath(conf, new
> > Path(args[1]));
> >
> >                // Run the job
> >                try {
> >                        JobClient.runJob(conf);
> >                } catch (IOException e) {
> >                        LOGGER.error(e.getMessage(), e);
+
Jeff Bean 2010-07-20, 13:01
+
Ted Yu 2010-07-20, 15:56
+
Jeff Bean 2010-07-20, 16:23
+
Ted Yu 2010-07-20, 16:38
+
Ted Yu 2010-07-20, 16:50
+
Peter Minearo 2010-07-20, 16:35
+
Scott Carey 2010-07-20, 18:24
+
Scott Carey 2010-07-20, 18:29