Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> Hadoop and XML


Copy link to this message
-
Re: Hadoop and XML
I also added Peter's comment to the JIRA I logged:
https://issues.apache.org/jira/browse/HADOOP-6868

On Tue, Jul 20, 2010 at 9:38 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> So the correct call should be:
> String valueString = new String(valueText.getBytes(), 0,
> valueText.getLength(), "UTF-8");
>
> Cheers
>
>
> On Tue, Jul 20, 2010 at 9:23 AM, Jeff Bean <[EMAIL PROTECTED]> wrote:
>
>> data.length is the length of the byte array.
>>
>> Text.getLength() most likely returns a different value than
>> getBytes.length.
>>
>> Hadoop reuses box class objects like Text, so what it's probably doing is
>> writing over the byte array, lengthening it as necessary, and just
>> updating
>> a separate length attribute.
>>
>> Jeff
>>
>> On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>
>> > Interesting.
>> > String class is able to handle this scenario:
>> >
>> >  348       public String(byte[] data, String encoding) throws
>> > UnsupportedEncodingException {
>> >  349           this(data, 0, data.length, encoding);
>> >  350       }
>> >
>> >
>> >
>> > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[EMAIL PROTECTED]>
>> wrote:
>> >
>> > > I think the problem is here:
>> > >
>> > > String valueString = new String(valueText.getBytes(), "UTF-8");
>> > >
>> > > Javadoc for Text says:
>> > >
>> > > *getBytes<
>> > >
>> >
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
>> > > >
>> > > *()
>> > >          Returns the raw bytes; however, only data up to
>> > > getLength()<
>> > >
>> >
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
>> > > >is
>> > > valid.
>> > >
>> > > So try getting the length, truncating the byte array at the value
>> > returned
>> > > by getLength() and THEN converting it to a String.
>> > >
>> > > Jeff
>> > >
>> > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
>> > >
>> > > > For your initial question on Text.set().
>> > > > Text.setCapacity() allocates new byte array. Since keepData is
>> false,
>> > old
>> > > > data wouldn't be copied over.
>> > > >
>> > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
>> > > > [EMAIL PROTECTED]> wrote:
>> > > >
>> > > > > I am already using XmlInputFormat.  The input into the Map phase
>> is
>> > not
>> > > > > the problem.  The problem lays in between the Map and Reduce
>> phase.
>> > > > >
>> > > > > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
>> > > > > XmlInputFormat is a lot faster.  From my testing,
>> > StreamXmlRecordReader
>> > > > > took 8 minutes to read a 1 GB XML document; where as,
>> XmlInputFormat
>> > > was
>> > > > > under 2 minutes. (Using 2 Core, 8GB machines)
>> > > > >
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Ted Yu [mailto:[EMAIL PROTECTED]]
>> > > > > Sent: Friday, July 16, 2010 9:44 PM
>> > > > > To: [EMAIL PROTECTED]
>> > > > > Subject: Re: Hadoop and XML
>> > > > >
>> > > > > From an earlier post:
>> > > > >
>> > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
>> > > > >
>> > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
>> > > > > [EMAIL PROTECTED]> wrote:
>> > > > >
>> > > > > > Moving the variable to a local variable did not seem to work:
>> > > > > >
>> > > > > >
>> > > > > > </PrivateRateSet>vateRateSet>
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > public void map(Object key, Object value, OutputCollector
>> output,
>> > > > > > Reporter
>> > > > > > reporter) throws IOException {
>> > > > > >                Text valueText = (Text)value;
>> > > > > >                String valueString = new
>> > String(valueText.getBytes(),
>> > > > > > "UTF-8");
>> > > > > >                String keyString = getXmlKey(valueString);
>> > > > > >                 Text returnKeyText = new Text();
>> > > > > >                Text returnValueText = new Text();
>> > > > > >                returnKeyText.set(keyString);
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB