Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> Hadoop and XML


Copy link to this message
-
Re: Hadoop and XML
I also added Peter's comment to the JIRA I logged:
https://issues.apache.org/jira/browse/HADOOP-6868

On Tue, Jul 20, 2010 at 9:38 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> So the correct call should be:
> String valueString = new String(valueText.getBytes(), 0,
> valueText.getLength(), "UTF-8");
>
> Cheers
>
>
> On Tue, Jul 20, 2010 at 9:23 AM, Jeff Bean <[EMAIL PROTECTED]> wrote:
>
>> data.length is the length of the byte array.
>>
>> Text.getLength() most likely returns a different value than
>> getBytes.length.
>>
>> Hadoop reuses box class objects like Text, so what it's probably doing is
>> writing over the byte array, lengthening it as necessary, and just
>> updating
>> a separate length attribute.
>>
>> Jeff
>>
>> On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>
>> > Interesting.
>> > String class is able to handle this scenario:
>> >
>> >  348       public String(byte[] data, String encoding) throws
>> > UnsupportedEncodingException {
>> >  349           this(data, 0, data.length, encoding);
>> >  350       }
>> >
>> >
>> >
>> > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[EMAIL PROTECTED]>
>> wrote:
>> >
>> > > I think the problem is here:
>> > >
>> > > String valueString = new String(valueText.getBytes(), "UTF-8");
>> > >
>> > > Javadoc for Text says:
>> > >
>> > > *getBytes<
>> > >
>> >
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
>> > > >
>> > > *()
>> > >          Returns the raw bytes; however, only data up to
>> > > getLength()<
>> > >
>> >
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
>> > > >is
>> > > valid.
>> > >
>> > > So try getting the length, truncating the byte array at the value
>> > returned
>> > > by getLength() and THEN converting it to a String.
>> > >
>> > > Jeff
>> > >
>> > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
>> > >
>> > > > For your initial question on Text.set().
>> > > > Text.setCapacity() allocates new byte array. Since keepData is
>> false,
>> > old
>> > > > data wouldn't be copied over.
>> > > >
>> > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
>> > > > [EMAIL PROTECTED]> wrote:
>> > > >
>> > > > > I am already using XmlInputFormat.  The input into the Map phase
>> is
>> > not
>> > > > > the problem.  The problem lays in between the Map and Reduce
>> phase.
>> > > > >
>> > > > > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
>> > > > > XmlInputFormat is a lot faster.  From my testing,
>> > StreamXmlRecordReader
>> > > > > took 8 minutes to read a 1 GB XML document; where as,
>> XmlInputFormat
>> > > was
>> > > > > under 2 minutes. (Using 2 Core, 8GB machines)
>> > > > >
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Ted Yu [mailto:[EMAIL PROTECTED]]
>> > > > > Sent: Friday, July 16, 2010 9:44 PM
>> > > > > To: [EMAIL PROTECTED]
>> > > > > Subject: Re: Hadoop and XML
>> > > > >
>> > > > > From an earlier post:
>> > > > >
>> > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
>> > > > >
>> > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
>> > > > > [EMAIL PROTECTED]> wrote:
>> > > > >
>> > > > > > Moving the variable to a local variable did not seem to work:
>> > > > > >
>> > > > > >
>> > > > > > </PrivateRateSet>vateRateSet>
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > public void map(Object key, Object value, OutputCollector
>> output,
>> > > > > > Reporter
>> > > > > > reporter) throws IOException {
>> > > > > >                Text valueText = (Text)value;
>> > > > > >                String valueString = new
>> > String(valueText.getBytes(),
>> > > > > > "UTF-8");
>> > > > > >                String keyString = getXmlKey(valueString);
>> > > > > >                 Text returnKeyText = new Text();
>> > > > > >                Text returnValueText = new Text();
>> > > > > >                returnKeyText.set(keyString);