Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # general >> Hadoop and XML


+
Peter Minearo 2010-07-16, 21:00
+
Soumya Banerjee 2010-07-16, 21:29
+
Peter Minearo 2010-07-16, 21:44
+
Peter Minearo 2010-07-16, 21:51
+
Peter Minearo 2010-07-16, 22:07
+
Ted Yu 2010-07-17, 04:43
+
Peter Minearo 2010-07-19, 15:01
+
Ted Yu 2010-07-19, 16:08
+
Jeff Bean 2010-07-20, 13:01
+
Ted Yu 2010-07-20, 15:56
Copy link to this message
-
Re: Hadoop and XML
data.length is the length of the byte array.

Text.getLength() most likely returns a different value than getBytes.length.

Hadoop reuses box class objects like Text, so what it's probably doing is
writing over the byte array, lengthening it as necessary, and just updating
a separate length attribute.

Jeff

On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Interesting.
> String class is able to handle this scenario:
>
>  348       public String(byte[] data, String encoding) throws
> UnsupportedEncodingException {
>  349           this(data, 0, data.length, encoding);
>  350       }
>
>
>
> On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[EMAIL PROTECTED]> wrote:
>
> > I think the problem is here:
> >
> > String valueString = new String(valueText.getBytes(), "UTF-8");
> >
> > Javadoc for Text says:
> >
> > *getBytes<
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
> > >
> > *()
> >          Returns the raw bytes; however, only data up to
> > getLength()<
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
> > >is
> > valid.
> >
> > So try getting the length, truncating the byte array at the value
> returned
> > by getLength() and THEN converting it to a String.
> >
> > Jeff
> >
> > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> >
> > > For your initial question on Text.set().
> > > Text.setCapacity() allocates new byte array. Since keepData is false,
> old
> > > data wouldn't be copied over.
> > >
> > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > I am already using XmlInputFormat.  The input into the Map phase is
> not
> > > > the problem.  The problem lays in between the Map and Reduce phase.
> > > >
> > > > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> > > > XmlInputFormat is a lot faster.  From my testing,
> StreamXmlRecordReader
> > > > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat
> > was
> > > > under 2 minutes. (Using 2 Core, 8GB machines)
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ted Yu [mailto:[EMAIL PROTECTED]]
> > > > Sent: Friday, July 16, 2010 9:44 PM
> > > > To: [EMAIL PROTECTED]
> > > > Subject: Re: Hadoop and XML
> > > >
> > > > From an earlier post:
> > > >
> http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
> > > >
> > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > > > Moving the variable to a local variable did not seem to work:
> > > > >
> > > > >
> > > > > </PrivateRateSet>vateRateSet>
> > > > >
> > > > >
> > > > >
> > > > > public void map(Object key, Object value, OutputCollector output,
> > > > > Reporter
> > > > > reporter) throws IOException {
> > > > >                Text valueText = (Text)value;
> > > > >                String valueString = new
> String(valueText.getBytes(),
> > > > > "UTF-8");
> > > > >                String keyString = getXmlKey(valueString);
> > > > >                 Text returnKeyText = new Text();
> > > > >                Text returnValueText = new Text();
> > > > >                returnKeyText.set(keyString);
> > > > >                returnValueText.set(valueString);
> > > > >                output.collect(returnKeyText, returnValueText); }
> > > > >
> > > > > -----Original Message-----
> > > > > From: Peter Minearo [mailto:[EMAIL PROTECTED]]
> > > > > Sent: Fri 7/16/2010 2:51 PM
> > > > > To: [EMAIL PROTECTED]
> > > > > Subject: RE: Hadoop and XML
> > > > >
> > > > > Whoops....right after I sent it and someone else made a suggestion;
> I
> > > > > realized what question 2 was about.  I can try that, but wouldn't
> > that
> > > >
> > > > > cause Object bloat?  During the Hadoop training I went through; it
> > was
> > > >
> > > > > mentioned to reuse the returning Key and Value objects to keep the
+
Ted Yu 2010-07-20, 16:38
+
Ted Yu 2010-07-20, 16:50
+
Peter Minearo 2010-07-20, 16:35
+
Scott Carey 2010-07-20, 18:24
+
Scott Carey 2010-07-20, 18:29
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB