Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # general >> Hadoop and XML


+
Peter Minearo 2010-07-16, 21:00
+
Soumya Banerjee 2010-07-16, 21:29
+
Peter Minearo 2010-07-16, 21:44
+
Peter Minearo 2010-07-16, 21:51
+
Peter Minearo 2010-07-16, 22:07
+
Ted Yu 2010-07-17, 04:43
+
Peter Minearo 2010-07-19, 15:01
+
Ted Yu 2010-07-19, 16:08
Copy link to this message
-
Re: Hadoop and XML
I think the problem is here:

String valueString = new String(valueText.getBytes(), "UTF-8");

Javadoc for Text says:

*getBytes<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29>
*()
          Returns the raw bytes; however, only data up to
getLength()<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29>is
valid.

So try getting the length, truncating the byte array at the value returned
by getLength() and THEN converting it to a String.

Jeff

On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> For your initial question on Text.set().
> Text.setCapacity() allocates new byte array. Since keepData is false, old
> data wouldn't be copied over.
>
> On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
> [EMAIL PROTECTED]> wrote:
>
> > I am already using XmlInputFormat.  The input into the Map phase is not
> > the problem.  The problem lays in between the Map and Reduce phase.
> >
> > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> > XmlInputFormat is a lot faster.  From my testing, StreamXmlRecordReader
> > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was
> > under 2 minutes. (Using 2 Core, 8GB machines)
> >
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, July 16, 2010 9:44 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: Hadoop and XML
> >
> > From an earlier post:
> > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
> >
> > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Moving the variable to a local variable did not seem to work:
> > >
> > >
> > > </PrivateRateSet>vateRateSet>
> > >
> > >
> > >
> > > public void map(Object key, Object value, OutputCollector output,
> > > Reporter
> > > reporter) throws IOException {
> > >                Text valueText = (Text)value;
> > >                String valueString = new String(valueText.getBytes(),
> > > "UTF-8");
> > >                String keyString = getXmlKey(valueString);
> > >                 Text returnKeyText = new Text();
> > >                Text returnValueText = new Text();
> > >                returnKeyText.set(keyString);
> > >                returnValueText.set(valueString);
> > >                output.collect(returnKeyText, returnValueText); }
> > >
> > > -----Original Message-----
> > > From: Peter Minearo [mailto:[EMAIL PROTECTED]]
> > > Sent: Fri 7/16/2010 2:51 PM
> > > To: [EMAIL PROTECTED]
> > > Subject: RE: Hadoop and XML
> > >
> > > Whoops....right after I sent it and someone else made a suggestion; I
> > > realized what question 2 was about.  I can try that, but wouldn't that
> >
> > > cause Object bloat?  During the Hadoop training I went through; it was
> >
> > > mentioned to reuse the returning Key and Value objects to keep the
> > > number of Objects created down to a minimum.  Is this not really a
> > > valid point?
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Peter Minearo [mailto:[EMAIL PROTECTED]]
> > > Sent: Friday, July 16, 2010 2:44 PM
> > > To: [EMAIL PROTECTED]
> > > Subject: RE: Hadoop and XML
> > >
> > >
> > > I am not using multi-threaded Map tasks.  Also, if I understand your
> > > second question correctly:
> > > "Also can you try creating the output key and values in the map
> > > method(method lacal) ?"
> > > In the first code snippet I am doing exactly that.
> > >
> > > Below is the class that runs the Job.
> > >
> > > public class HadoopJobClient {
> > >
> > >        private static final Log LOGGER > > > LogFactory.getLog(Prds.class.getName());
> > >
> > >        public static void main(String[] args) {
> > >                JobConf conf = new JobConf(Prds.class);
> > >
> > >                conf.set("xmlinput.start", "<PrivateRateSet>");
> > >                conf.set("xmlinput.end", "</PrivateRateSet>");
+
Ted Yu 2010-07-20, 15:56
+
Jeff Bean 2010-07-20, 16:23
+
Ted Yu 2010-07-20, 16:38
+
Ted Yu 2010-07-20, 16:50
+
Peter Minearo 2010-07-20, 16:35
+
Scott Carey 2010-07-20, 18:24
+
Scott Carey 2010-07-20, 18:29
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB