Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # general - Hadoop and XML


Copy link to this message
-
Re: Hadoop and XML
Ted Yu 2010-07-20, 16:38
So the correct call should be:
String valueString = new String(valueText.getBytes(), 0,
valueText.getLength(), "UTF-8");

Cheers

On Tue, Jul 20, 2010 at 9:23 AM, Jeff Bean <[EMAIL PROTECTED]> wrote:

> data.length is the length of the byte array.
>
> Text.getLength() most likely returns a different value than
> getBytes.length.
>
> Hadoop reuses box class objects like Text, so what it's probably doing is
> writing over the byte array, lengthening it as necessary, and just updating
> a separate length attribute.
>
> Jeff
>
> On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > Interesting.
> > String class is able to handle this scenario:
> >
> >  348       public String(byte[] data, String encoding) throws
> > UnsupportedEncodingException {
> >  349           this(data, 0, data.length, encoding);
> >  350       }
> >
> >
> >
> > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[EMAIL PROTECTED]> wrote:
> >
> > > I think the problem is here:
> > >
> > > String valueString = new String(valueText.getBytes(), "UTF-8");
> > >
> > > Javadoc for Text says:
> > >
> > > *getBytes<
> > >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
> > > >
> > > *()
> > >          Returns the raw bytes; however, only data up to
> > > getLength()<
> > >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
> > > >is
> > > valid.
> > >
> > > So try getting the length, truncating the byte array at the value
> > returned
> > > by getLength() and THEN converting it to a String.
> > >
> > > Jeff
> > >
> > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > For your initial question on Text.set().
> > > > Text.setCapacity() allocates new byte array. Since keepData is false,
> > old
> > > > data wouldn't be copied over.
> > > >
> > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > > > I am already using XmlInputFormat.  The input into the Map phase is
> > not
> > > > > the problem.  The problem lays in between the Map and Reduce phase.
> > > > >
> > > > > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> > > > > XmlInputFormat is a lot faster.  From my testing,
> > StreamXmlRecordReader
> > > > > took 8 minutes to read a 1 GB XML document; where as,
> XmlInputFormat
> > > was
> > > > > under 2 minutes. (Using 2 Core, 8GB machines)
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Ted Yu [mailto:[EMAIL PROTECTED]]
> > > > > Sent: Friday, July 16, 2010 9:44 PM
> > > > > To: [EMAIL PROTECTED]
> > > > > Subject: Re: Hadoop and XML
> > > > >
> > > > > From an earlier post:
> > > > >
> > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
> > > > >
> > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> > > > > [EMAIL PROTECTED]> wrote:
> > > > >
> > > > > > Moving the variable to a local variable did not seem to work:
> > > > > >
> > > > > >
> > > > > > </PrivateRateSet>vateRateSet>
> > > > > >
> > > > > >
> > > > > >
> > > > > > public void map(Object key, Object value, OutputCollector output,
> > > > > > Reporter
> > > > > > reporter) throws IOException {
> > > > > >                Text valueText = (Text)value;
> > > > > >                String valueString = new
> > String(valueText.getBytes(),
> > > > > > "UTF-8");
> > > > > >                String keyString = getXmlKey(valueString);
> > > > > >                 Text returnKeyText = new Text();
> > > > > >                Text returnValueText = new Text();
> > > > > >                returnKeyText.set(keyString);
> > > > > >                returnValueText.set(valueString);
> > > > > >                output.collect(returnKeyText, returnValueText); }
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Peter Minearo [mailto:[EMAIL PROTECTED]]
> > > > > > Sent: Fri 7/16/2010 2