Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> Hadoop and XML


Copy link to this message
-
Re: Hadoop and XML
For your initial question on Text.set().
Text.setCapacity() allocates new byte array. Since keepData is false, old
data wouldn't be copied over.

On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
[EMAIL PROTECTED]> wrote:

> I am already using XmlInputFormat.  The input into the Map phase is not
> the problem.  The problem lays in between the Map and Reduce phase.
>
> BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> XmlInputFormat is a lot faster.  From my testing, StreamXmlRecordReader
> took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was
> under 2 minutes. (Using 2 Core, 8GB machines)
>
>
> -----Original Message-----
> From: Ted Yu [mailto:[EMAIL PROTECTED]]
> Sent: Friday, July 16, 2010 9:44 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Hadoop and XML
>
> From an earlier post:
> http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
>
> On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> [EMAIL PROTECTED]> wrote:
>
> > Moving the variable to a local variable did not seem to work:
> >
> >
> > </PrivateRateSet>vateRateSet>
> >
> >
> >
> > public void map(Object key, Object value, OutputCollector output,
> > Reporter
> > reporter) throws IOException {
> >                Text valueText = (Text)value;
> >                String valueString = new String(valueText.getBytes(),
> > "UTF-8");
> >                String keyString = getXmlKey(valueString);
> >                 Text returnKeyText = new Text();
> >                Text returnValueText = new Text();
> >                returnKeyText.set(keyString);
> >                returnValueText.set(valueString);
> >                output.collect(returnKeyText, returnValueText); }
> >
> > -----Original Message-----
> > From: Peter Minearo [mailto:[EMAIL PROTECTED]]
> > Sent: Fri 7/16/2010 2:51 PM
> > To: [EMAIL PROTECTED]
> > Subject: RE: Hadoop and XML
> >
> > Whoops....right after I sent it and someone else made a suggestion; I
> > realized what question 2 was about.  I can try that, but wouldn't that
>
> > cause Object bloat?  During the Hadoop training I went through; it was
>
> > mentioned to reuse the returning Key and Value objects to keep the
> > number of Objects created down to a minimum.  Is this not really a
> > valid point?
> >
> >
> >
> > -----Original Message-----
> > From: Peter Minearo [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, July 16, 2010 2:44 PM
> > To: [EMAIL PROTECTED]
> > Subject: RE: Hadoop and XML
> >
> >
> > I am not using multi-threaded Map tasks.  Also, if I understand your
> > second question correctly:
> > "Also can you try creating the output key and values in the map
> > method(method lacal) ?"
> > In the first code snippet I am doing exactly that.
> >
> > Below is the class that runs the Job.
> >
> > public class HadoopJobClient {
> >
> >        private static final Log LOGGER > > LogFactory.getLog(Prds.class.getName());
> >
> >        public static void main(String[] args) {
> >                JobConf conf = new JobConf(Prds.class);
> >
> >                conf.set("xmlinput.start", "<PrivateRateSet>");
> >                conf.set("xmlinput.end", "</PrivateRateSet>");
> >
> >                conf.setJobName("PRDS Parse");
> >
> >                conf.setOutputKeyClass(Text.class);
> >                conf.setOutputValueClass(Text.class);
> >
> >                conf.setMapperClass(PrdsMapper.class);
> >                conf.setReducerClass(PrdsReducer.class);
> >
> >                conf.setInputFormat(XmlInputFormat.class);
> >                conf.setOutputFormat(TextOutputFormat.class);
> >
> >                FileInputFormat.setInputPaths(conf, new Path(args[0]));
> >                FileOutputFormat.setOutputPath(conf, new
> > Path(args[1]));
> >
> >                // Run the job
> >                try {
> >                        JobClient.runJob(conf);
> >                } catch (IOException e) {
> >                        LOGGER.error(e.getMessage(), e);
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB