|
Peter Minearo
2010-07-16, 21:00
Soumya Banerjee
2010-07-16, 21:29
Peter Minearo
2010-07-16, 21:44
Peter Minearo
2010-07-16, 21:51
Peter Minearo
2010-07-16, 22:07
Ted Yu
2010-07-17, 04:43
Peter Minearo
2010-07-19, 15:01
Ted Yu
2010-07-19, 16:08
Jeff Bean
2010-07-20, 13:01
Ted Yu
2010-07-20, 15:56
Jeff Bean
2010-07-20, 16:23
Ted Yu
2010-07-20, 16:38
Ted Yu
2010-07-20, 16:50
Peter Minearo
2010-07-20, 16:35
Scott Carey
2010-07-20, 18:24
Scott Carey
2010-07-20, 18:29
|
-
Hadoop and XMLPeter Minearo 2010-07-16, 21:00
I have an XML file that has sparse data in it. I am running a MapReduce
Job that reads in an XML file, pulls out a Key from within the XML snippet and then hands back the Key and the XML snippet (as the Value) to the OutputCollector. The reason is to sort the file back into order. Below is the snippet of code. public class XmlMapper extends MapReduceBase implements Mapper { private Text keyText = new Text(); private Text valueText = new Text(); @SuppressWarnings("unchecked") public void map(Object key, Object value, OutputCollector output, Reporter reporter) throws IOException { Text valueText = (Text)value; String valueString = new String(valueText.getBytes(), "UTF-8"); String keyString = getXmlKey(valueString); getKeyText().set(keyString); getValueText().set(valueString); output.collect(getKeyText(), getValueText()); } public Text getKeyText() { return keyText; } public void setKeyText(Text keyText) { this.keyText = keyText; } public Text getValueText() { return valueText; } public void setValueText(Text valueText) { this.valueText = valueText; } private String getXmlKey(String value) { // Get the Key from the XML in the value. } } The XML snippet from the Value is fine when it is passed into the map() method. I am not changing any data either, just pulling out information for the key. The problem I am seeing is between the Map phase and the Reduce phase, the XML is getting munged. For Example: </PrivateRate> </PrivateRateSet>te> It is my understanding that Hadoop uses the same instance of the Key and Value object when calling the Map method. What changes is the data within those instances. So, I ran an experiment where I do not have different Key or Value Text Objects. I reuse the ones passed into the method, like below: public class XmlMapper extends MapReduceBase implements Mapper { @SuppressWarnings("unchecked") public void map(Object key, Object value, OutputCollector output, Reporter reporter) throws IOException { Text keyText = (Text)key; Text valueText = (Text)value; String valueString = new String(valueText.getBytes(), "UTF-8"); String keyString = getXmlKey(valueString); keyText.set(keyString); valueText.set(valueString); output.collect(keyText, valueText); } private String getXmlKey(String value) { // Get the Key from the XML in the value. } } What was interesting about this is the fact that the XML was getting munged within the Map Phase. When I changed over to the code at the top, the Map phase was fine. However, the Reduce phase picks up the munged XML. Trying to debug the problem, I came across this method in the Text Object: public void set(byte[] utf8, int start, int len) { setCapacity(len, false); System.arraycopy(utf8, start, bytes, 0, len); this.length = len; } If the "bytes" array had a length of 1000 and the "utf8" array has a length of 500; doing a System.arraycopy() would only copy the first 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone. Could this be the cause of the XML munging? All of this leads me to a few questions: 1) Has anyone successfully used XML snippets as the data format within a MapReduce job; not just reading from the file but used during the shuffle? 2) Is anyone seeing this problem with XML or any other format? 3) Does anyone know what is going on? 4) Is this a bug? Thanks, Peter +
Peter Minearo 2010-07-16, 21:00
-
Re: Hadoop and XMLSoumya Banerjee 2010-07-16, 21:29
Hi,
Can you please share the code of the job submission client ? Also can you try creating the output key and values in the map method(method lacal) ? Make sure you are not using multi threaded map task configuration. map() { private Text keyText = new Text(); private Text valueText = new Text(); //rest of the code } Soumya. On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo < [EMAIL PROTECTED]> wrote: > I have an XML file that has sparse data in it. I am running a MapReduce > Job that reads in an XML file, pulls out a Key from within the XML > snippet and then hands back the Key and the XML snippet (as the Value) > to the OutputCollector. The reason is to sort the file back into order. > Below is the snippet of code. > > public class XmlMapper extends MapReduceBase implements Mapper { > > private Text keyText = new Text(); > private Text valueText = new Text(); > > @SuppressWarnings("unchecked") > public void map(Object key, Object value, OutputCollector output, > Reporter reporter) throws IOException { > Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), "UTF-8"); > String keyString = getXmlKey(valueString); > getKeyText().set(keyString); > getValueText().set(valueString); > output.collect(getKeyText(), getValueText()); > } > > > public Text getKeyText() { > return keyText; > } > > > public void setKeyText(Text keyText) { > this.keyText = keyText; > } > > > public Text getValueText() { > return valueText; > } > > > public void setValueText(Text valueText) { > this.valueText = valueText; > } > > > private String getXmlKey(String value) { > // Get the Key from the XML in the value. > } > > } > > The XML snippet from the Value is fine when it is passed into the map() > method. I am not changing any data either, just pulling out information > for the key. The problem I am seeing is between the Map phase and the > Reduce phase, the XML is getting munged. For Example: > > </PrivateRate> > </PrivateRateSet>te> > > It is my understanding that Hadoop uses the same instance of the Key and > Value object when calling the Map method. What changes is the data > within those instances. So, I ran an experiment where I do not have > different Key or Value Text Objects. I reuse the ones passed into the > method, like below: > > public class XmlMapper extends MapReduceBase implements Mapper { > > @SuppressWarnings("unchecked") > public void map(Object key, Object value, OutputCollector output, > Reporter reporter) throws IOException { > Text keyText = (Text)key; > Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), "UTF-8"); > String keyString = getXmlKey(valueString); > keyText.set(keyString); > valueText.set(valueString); > output.collect(keyText, valueText); > } > > > private String getXmlKey(String value) { > // Get the Key from the XML in the value. > } > > } > > What was interesting about this is the fact that the XML was getting > munged within the Map Phase. When I changed over to the code at the > top, the Map phase was fine. However, the Reduce phase picks up the > munged XML. Trying to debug the problem, I came across this method in > the Text Object: > > public void set(byte[] utf8, int start, int len) { > setCapacity(len, false); > System.arraycopy(utf8, start, bytes, 0, len); > this.length = len; > } > > If the "bytes" array had a length of 1000 and the "utf8" array has a > length of 500; doing a System.arraycopy() would only copy the first 500 > from "utf8" to "bytes" but leave the last 500 in "bytes" alone. Could > this be the cause of the XML munging? > > All of this leads me to a few questions: > > 1) Has anyone successfully used XML snippets as the data format within a > MapReduce job; not just reading from the file but used during the > shuffle? > 2) Is anyone seeing this problem with XML or any other format? > 3) Does anyone know what is going on? > 4) Is this a bug? +
Soumya Banerjee 2010-07-16, 21:29
-
RE: Hadoop and XMLPeter Minearo 2010-07-16, 21:44
I am not using multi-threaded Map tasks. Also, if I understand your second question correctly: "Also can you try creating the output key and values in the map method(method lacal) ?" In the first code snippet I am doing exactly that. Below is the class that runs the Job. public class HadoopJobClient { private static final Log LOGGER = LogFactory.getLog(Prds.class.getName()); public static void main(String[] args) { JobConf conf = new JobConf(Prds.class); conf.set("xmlinput.start", "<PrivateRateSet>"); conf.set("xmlinput.end", "</PrivateRateSet>"); conf.setJobName("PRDS Parse"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setMapperClass(PrdsMapper.class); conf.setReducerClass(PrdsReducer.class); conf.setInputFormat(XmlInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); // Run the job try { JobClient.runJob(conf); } catch (IOException e) { LOGGER.error(e.getMessage(), e); } } } -----Original Message----- From: Soumya Banerjee [mailto:[EMAIL PROTECTED]] Sent: Fri 7/16/2010 2:29 PM To: [EMAIL PROTECTED] Subject: Re: Hadoop and XML Hi, Can you please share the code of the job submission client ? Also can you try creating the output key and values in the map method(method lacal) ? Make sure you are not using multi threaded map task configuration. map() { private Text keyText = new Text(); private Text valueText = new Text(); //rest of the code } Soumya. On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo < [EMAIL PROTECTED]> wrote: > I have an XML file that has sparse data in it. I am running a MapReduce > Job that reads in an XML file, pulls out a Key from within the XML > snippet and then hands back the Key and the XML snippet (as the Value) > to the OutputCollector. The reason is to sort the file back into order. > Below is the snippet of code. > > public class XmlMapper extends MapReduceBase implements Mapper { > > private Text keyText = new Text(); > private Text valueText = new Text(); > > @SuppressWarnings("unchecked") > public void map(Object key, Object value, OutputCollector output, > Reporter reporter) throws IOException { > Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), "UTF-8"); > String keyString = getXmlKey(valueString); > getKeyText().set(keyString); > getValueText().set(valueString); > output.collect(getKeyText(), getValueText()); > } > > > public Text getKeyText() { > return keyText; > } > > > public void setKeyText(Text keyText) { > this.keyText = keyText; > } > > > public Text getValueText() { > return valueText; > } > > > public void setValueText(Text valueText) { > this.valueText = valueText; > } > > > private String getXmlKey(String value) { > // Get the Key from the XML in the value. > } > > } > > The XML snippet from the Value is fine when it is passed into the map() > method. I am not changing any data either, just pulling out information > for the key. The problem I am seeing is between the Map phase and the > Reduce phase, the XML is getting munged. For Example: > > </PrivateRate> > </PrivateRateSet>te> > > It is my understanding that Hadoop uses the same instance of the Key and > Value object when calling the Map method. What changes is the data > within those instances. So, I ran an experiment where I do not have > different Key or Value Text Objects. I reuse the ones passed into the > method, like below: > > public class XmlMapper extends MapReduceBase implements Mapper { > > @SuppressWarnings("unchecked") > public void map(Object key, Object value, OutputCollector output, > Reporter reporter) throws IOException { > Text keyText = (Text)key; > Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), "UTF-8"); > String keyString = getXmlKey(valueString); +
Peter Minearo 2010-07-16, 21:44
-
RE: Hadoop and XMLPeter Minearo 2010-07-16, 21:51
Whoops....right after I sent it and someone else made a suggestion; I
realized what question 2 was about. I can try that, but wouldn't that cause Object bloat? During the Hadoop training I went through; it was mentioned to reuse the returning Key and Value objects to keep the number of Objects created down to a minimum. Is this not really a valid point? -----Original Message----- From: Peter Minearo [mailto:[EMAIL PROTECTED]] Sent: Friday, July 16, 2010 2:44 PM To: [EMAIL PROTECTED] Subject: RE: Hadoop and XML I am not using multi-threaded Map tasks. Also, if I understand your second question correctly: "Also can you try creating the output key and values in the map method(method lacal) ?" In the first code snippet I am doing exactly that. Below is the class that runs the Job. public class HadoopJobClient { private static final Log LOGGER LogFactory.getLog(Prds.class.getName()); public static void main(String[] args) { JobConf conf = new JobConf(Prds.class); conf.set("xmlinput.start", "<PrivateRateSet>"); conf.set("xmlinput.end", "</PrivateRateSet>"); conf.setJobName("PRDS Parse"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setMapperClass(PrdsMapper.class); conf.setReducerClass(PrdsReducer.class); conf.setInputFormat(XmlInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); // Run the job try { JobClient.runJob(conf); } catch (IOException e) { LOGGER.error(e.getMessage(), e); } } } -----Original Message----- From: Soumya Banerjee [mailto:[EMAIL PROTECTED]] Sent: Fri 7/16/2010 2:29 PM To: [EMAIL PROTECTED] Subject: Re: Hadoop and XML Hi, Can you please share the code of the job submission client ? Also can you try creating the output key and values in the map method(method lacal) ? Make sure you are not using multi threaded map task configuration. map() { private Text keyText = new Text(); private Text valueText = new Text(); //rest of the code } Soumya. On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo < [EMAIL PROTECTED]> wrote: > I have an XML file that has sparse data in it. I am running a > MapReduce Job that reads in an XML file, pulls out a Key from within > the XML snippet and then hands back the Key and the XML snippet (as > the Value) to the OutputCollector. The reason is to sort the file back into order. > Below is the snippet of code. > > public class XmlMapper extends MapReduceBase implements Mapper { > > private Text keyText = new Text(); > private Text valueText = new Text(); > > @SuppressWarnings("unchecked") > public void map(Object key, Object value, OutputCollector output, > Reporter reporter) throws IOException { Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), "UTF-8"); > String keyString = getXmlKey(valueString); > getKeyText().set(keyString); getValueText().set(valueString); > output.collect(getKeyText(), getValueText()); } > > > public Text getKeyText() { > return keyText; > } > > > public void setKeyText(Text keyText) { this.keyText = keyText; } > > > public Text getValueText() { > return valueText; > } > > > public void setValueText(Text valueText) { this.valueText = > valueText; } > > > private String getXmlKey(String value) { > // Get the Key from the XML in the value. > } > > } > > The XML snippet from the Value is fine when it is passed into the > map() method. I am not changing any data either, just pulling out > information for the key. The problem I am seeing is between the Map > phase and the Reduce phase, the XML is getting munged. For Example: > > </PrivateRate> > </PrivateRateSet>te> > > It is my understanding that Hadoop uses the same instance of the Key > and Value object when calling the Map method. What changes is the +
Peter Minearo 2010-07-16, 21:51
-
RE: Hadoop and XMLPeter Minearo 2010-07-16, 22:07
Moving the variable to a local variable did not seem to work:
</PrivateRateSet>vateRateSet> public void map(Object key, Object value, OutputCollector output, Reporter reporter) throws IOException { Text valueText = (Text)value; String valueString = new String(valueText.getBytes(), "UTF-8"); String keyString = getXmlKey(valueString); Text returnKeyText = new Text(); Text returnValueText = new Text(); returnKeyText.set(keyString); returnValueText.set(valueString); output.collect(returnKeyText, returnValueText); } -----Original Message----- From: Peter Minearo [mailto:[EMAIL PROTECTED]] Sent: Fri 7/16/2010 2:51 PM To: [EMAIL PROTECTED] Subject: RE: Hadoop and XML Whoops....right after I sent it and someone else made a suggestion; I realized what question 2 was about. I can try that, but wouldn't that cause Object bloat? During the Hadoop training I went through; it was mentioned to reuse the returning Key and Value objects to keep the number of Objects created down to a minimum. Is this not really a valid point? -----Original Message----- From: Peter Minearo [mailto:[EMAIL PROTECTED]] Sent: Friday, July 16, 2010 2:44 PM To: [EMAIL PROTECTED] Subject: RE: Hadoop and XML I am not using multi-threaded Map tasks. Also, if I understand your second question correctly: "Also can you try creating the output key and values in the map method(method lacal) ?" In the first code snippet I am doing exactly that. Below is the class that runs the Job. public class HadoopJobClient { private static final Log LOGGER LogFactory.getLog(Prds.class.getName()); public static void main(String[] args) { JobConf conf = new JobConf(Prds.class); conf.set("xmlinput.start", "<PrivateRateSet>"); conf.set("xmlinput.end", "</PrivateRateSet>"); conf.setJobName("PRDS Parse"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setMapperClass(PrdsMapper.class); conf.setReducerClass(PrdsReducer.class); conf.setInputFormat(XmlInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); // Run the job try { JobClient.runJob(conf); } catch (IOException e) { LOGGER.error(e.getMessage(), e); } } } -----Original Message----- From: Soumya Banerjee [mailto:[EMAIL PROTECTED]] Sent: Fri 7/16/2010 2:29 PM To: [EMAIL PROTECTED] Subject: Re: Hadoop and XML Hi, Can you please share the code of the job submission client ? Also can you try creating the output key and values in the map method(method lacal) ? Make sure you are not using multi threaded map task configuration. map() { private Text keyText = new Text(); private Text valueText = new Text(); //rest of the code } Soumya. On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo < [EMAIL PROTECTED]> wrote: > I have an XML file that has sparse data in it. I am running a > MapReduce Job that reads in an XML file, pulls out a Key from within > the XML snippet and then hands back the Key and the XML snippet (as > the Value) to the OutputCollector. The reason is to sort the file back into order. > Below is the snippet of code. > > public class XmlMapper extends MapReduceBase implements Mapper { > > private Text keyText = new Text(); > private Text valueText = new Text(); > > @SuppressWarnings("unchecked") > public void map(Object key, Object value, OutputCollector output, > Reporter reporter) throws IOException { Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), "UTF-8"); > String keyString = getXmlKey(valueString); > getKeyText().set(keyString); getValueText().set(valueString); > output.collect(getKeyText(), getValueText()); } > > > public Text getKeyText() { > return keyText; > } > > > public void setKeyText(Text keyText) { this.keyText = keyText; } +
Peter Minearo 2010-07-16, 22:07
-
Re: Hadoop and XMLTed Yu 2010-07-17, 04:43
>From an earlier post:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < [EMAIL PROTECTED]> wrote: > Moving the variable to a local variable did not seem to work: > > > </PrivateRateSet>vateRateSet> > > > > public void map(Object key, Object value, OutputCollector output, Reporter > reporter) throws IOException { > Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), > "UTF-8"); > String keyString = getXmlKey(valueString); > Text returnKeyText = new Text(); > Text returnValueText = new Text(); > returnKeyText.set(keyString); > returnValueText.set(valueString); > output.collect(returnKeyText, returnValueText); > } > > -----Original Message----- > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > Sent: Fri 7/16/2010 2:51 PM > To: [EMAIL PROTECTED] > Subject: RE: Hadoop and XML > > Whoops....right after I sent it and someone else made a suggestion; I > realized what question 2 was about. I can try that, but wouldn't that > cause Object bloat? During the Hadoop training I went through; it was > mentioned to reuse the returning Key and Value objects to keep the > number of Objects created down to a minimum. Is this not really a valid > point? > > > > -----Original Message----- > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > Sent: Friday, July 16, 2010 2:44 PM > To: [EMAIL PROTECTED] > Subject: RE: Hadoop and XML > > > I am not using multi-threaded Map tasks. Also, if I understand your > second question correctly: > "Also can you try creating the output key and values in the map > method(method lacal) ?" > In the first code snippet I am doing exactly that. > > Below is the class that runs the Job. > > public class HadoopJobClient { > > private static final Log LOGGER > LogFactory.getLog(Prds.class.getName()); > > public static void main(String[] args) { > JobConf conf = new JobConf(Prds.class); > > conf.set("xmlinput.start", "<PrivateRateSet>"); > conf.set("xmlinput.end", "</PrivateRateSet>"); > > conf.setJobName("PRDS Parse"); > > conf.setOutputKeyClass(Text.class); > conf.setOutputValueClass(Text.class); > > conf.setMapperClass(PrdsMapper.class); > conf.setReducerClass(PrdsReducer.class); > > conf.setInputFormat(XmlInputFormat.class); > conf.setOutputFormat(TextOutputFormat.class); > > FileInputFormat.setInputPaths(conf, new Path(args[0])); > FileOutputFormat.setOutputPath(conf, new Path(args[1])); > > // Run the job > try { > JobClient.runJob(conf); > } catch (IOException e) { > LOGGER.error(e.getMessage(), e); > } > > } > > > } > > > > > -----Original Message----- > From: Soumya Banerjee [mailto:[EMAIL PROTECTED]] > Sent: Fri 7/16/2010 2:29 PM > To: [EMAIL PROTECTED] > Subject: Re: Hadoop and XML > > Hi, > > Can you please share the code of the job submission client ? > > Also can you try creating the output key and values in the map > method(method > lacal) ? > Make sure you are not using multi threaded map task configuration. > > map() > { > private Text keyText = new Text(); > private Text valueText = new Text(); > > //rest of the code > } > > Soumya. > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo < > [EMAIL PROTECTED]> wrote: > > > I have an XML file that has sparse data in it. I am running a > > MapReduce Job that reads in an XML file, pulls out a Key from within > > the XML snippet and then hands back the Key and the XML snippet (as > > the Value) to the OutputCollector. The reason is to sort the file +
Ted Yu 2010-07-17, 04:43
-
RE: Hadoop and XMLPeter Minearo 2010-07-19, 15:01
I am already using XmlInputFormat. The input into the Map phase is not
the problem. The problem lays in between the Map and Reduce phase. BTW - The article is correct. DO NOT USE StreamXmlRecordReader. XmlInputFormat is a lot faster. From my testing, StreamXmlRecordReader took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was under 2 minutes. (Using 2 Core, 8GB machines) -----Original Message----- From: Ted Yu [mailto:[EMAIL PROTECTED]] Sent: Friday, July 16, 2010 9:44 PM To: [EMAIL PROTECTED] Subject: Re: Hadoop and XML >From an earlier post: http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < [EMAIL PROTECTED]> wrote: > Moving the variable to a local variable did not seem to work: > > > </PrivateRateSet>vateRateSet> > > > > public void map(Object key, Object value, OutputCollector output, > Reporter > reporter) throws IOException { > Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), > "UTF-8"); > String keyString = getXmlKey(valueString); > Text returnKeyText = new Text(); > Text returnValueText = new Text(); > returnKeyText.set(keyString); > returnValueText.set(valueString); > output.collect(returnKeyText, returnValueText); } > > -----Original Message----- > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > Sent: Fri 7/16/2010 2:51 PM > To: [EMAIL PROTECTED] > Subject: RE: Hadoop and XML > > Whoops....right after I sent it and someone else made a suggestion; I > realized what question 2 was about. I can try that, but wouldn't that > cause Object bloat? During the Hadoop training I went through; it was > mentioned to reuse the returning Key and Value objects to keep the > number of Objects created down to a minimum. Is this not really a > valid point? > > > > -----Original Message----- > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > Sent: Friday, July 16, 2010 2:44 PM > To: [EMAIL PROTECTED] > Subject: RE: Hadoop and XML > > > I am not using multi-threaded Map tasks. Also, if I understand your > second question correctly: > "Also can you try creating the output key and values in the map > method(method lacal) ?" > In the first code snippet I am doing exactly that. > > Below is the class that runs the Job. > > public class HadoopJobClient { > > private static final Log LOGGER = > LogFactory.getLog(Prds.class.getName()); > > public static void main(String[] args) { > JobConf conf = new JobConf(Prds.class); > > conf.set("xmlinput.start", "<PrivateRateSet>"); > conf.set("xmlinput.end", "</PrivateRateSet>"); > > conf.setJobName("PRDS Parse"); > > conf.setOutputKeyClass(Text.class); > conf.setOutputValueClass(Text.class); > > conf.setMapperClass(PrdsMapper.class); > conf.setReducerClass(PrdsReducer.class); > > conf.setInputFormat(XmlInputFormat.class); > conf.setOutputFormat(TextOutputFormat.class); > > FileInputFormat.setInputPaths(conf, new Path(args[0])); > FileOutputFormat.setOutputPath(conf, new > Path(args[1])); > > // Run the job > try { > JobClient.runJob(conf); > } catch (IOException e) { > LOGGER.error(e.getMessage(), e); > } > > } > > > } > > > > > -----Original Message----- > From: Soumya Banerjee [mailto:[EMAIL PROTECTED]] > Sent: Fri 7/16/2010 2:29 PM > To: [EMAIL PROTECTED] > Subject: Re: Hadoop and XML > > Hi, > > Can you please share the code of the job submission client ? > > Also can you try creating the output key and values in the map > method(method > lacal) ? +
Peter Minearo 2010-07-19, 15:01
-
Re: Hadoop and XMLTed Yu 2010-07-19, 16:08
For your initial question on Text.set().
Text.setCapacity() allocates new byte array. Since keepData is false, old data wouldn't be copied over. On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo < [EMAIL PROTECTED]> wrote: > I am already using XmlInputFormat. The input into the Map phase is not > the problem. The problem lays in between the Map and Reduce phase. > > BTW - The article is correct. DO NOT USE StreamXmlRecordReader. > XmlInputFormat is a lot faster. From my testing, StreamXmlRecordReader > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was > under 2 minutes. (Using 2 Core, 8GB machines) > > > -----Original Message----- > From: Ted Yu [mailto:[EMAIL PROTECTED]] > Sent: Friday, July 16, 2010 9:44 PM > To: [EMAIL PROTECTED] > Subject: Re: Hadoop and XML > > From an earlier post: > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < > [EMAIL PROTECTED]> wrote: > > > Moving the variable to a local variable did not seem to work: > > > > > > </PrivateRateSet>vateRateSet> > > > > > > > > public void map(Object key, Object value, OutputCollector output, > > Reporter > > reporter) throws IOException { > > Text valueText = (Text)value; > > String valueString = new String(valueText.getBytes(), > > "UTF-8"); > > String keyString = getXmlKey(valueString); > > Text returnKeyText = new Text(); > > Text returnValueText = new Text(); > > returnKeyText.set(keyString); > > returnValueText.set(valueString); > > output.collect(returnKeyText, returnValueText); } > > > > -----Original Message----- > > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > > Sent: Fri 7/16/2010 2:51 PM > > To: [EMAIL PROTECTED] > > Subject: RE: Hadoop and XML > > > > Whoops....right after I sent it and someone else made a suggestion; I > > realized what question 2 was about. I can try that, but wouldn't that > > > cause Object bloat? During the Hadoop training I went through; it was > > > mentioned to reuse the returning Key and Value objects to keep the > > number of Objects created down to a minimum. Is this not really a > > valid point? > > > > > > > > -----Original Message----- > > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > > Sent: Friday, July 16, 2010 2:44 PM > > To: [EMAIL PROTECTED] > > Subject: RE: Hadoop and XML > > > > > > I am not using multi-threaded Map tasks. Also, if I understand your > > second question correctly: > > "Also can you try creating the output key and values in the map > > method(method lacal) ?" > > In the first code snippet I am doing exactly that. > > > > Below is the class that runs the Job. > > > > public class HadoopJobClient { > > > > private static final Log LOGGER > > LogFactory.getLog(Prds.class.getName()); > > > > public static void main(String[] args) { > > JobConf conf = new JobConf(Prds.class); > > > > conf.set("xmlinput.start", "<PrivateRateSet>"); > > conf.set("xmlinput.end", "</PrivateRateSet>"); > > > > conf.setJobName("PRDS Parse"); > > > > conf.setOutputKeyClass(Text.class); > > conf.setOutputValueClass(Text.class); > > > > conf.setMapperClass(PrdsMapper.class); > > conf.setReducerClass(PrdsReducer.class); > > > > conf.setInputFormat(XmlInputFormat.class); > > conf.setOutputFormat(TextOutputFormat.class); > > > > FileInputFormat.setInputPaths(conf, new Path(args[0])); > > FileOutputFormat.setOutputPath(conf, new > > Path(args[1])); > > > > // Run the job > > try { > > JobClient.runJob(conf); > > } catch (IOException e) { > > LOGGER.error(e.getMessage(), e); +
Ted Yu 2010-07-19, 16:08
-
Re: Hadoop and XMLJeff Bean 2010-07-20, 13:01
I think the problem is here:
String valueString = new String(valueText.getBytes(), "UTF-8"); Javadoc for Text says: *getBytes<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29> *() Returns the raw bytes; however, only data up to getLength()<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29>is valid. So try getting the length, truncating the byte array at the value returned by getLength() and THEN converting it to a String. Jeff On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > For your initial question on Text.set(). > Text.setCapacity() allocates new byte array. Since keepData is false, old > data wouldn't be copied over. > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo < > [EMAIL PROTECTED]> wrote: > > > I am already using XmlInputFormat. The input into the Map phase is not > > the problem. The problem lays in between the Map and Reduce phase. > > > > BTW - The article is correct. DO NOT USE StreamXmlRecordReader. > > XmlInputFormat is a lot faster. From my testing, StreamXmlRecordReader > > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was > > under 2 minutes. (Using 2 Core, 8GB machines) > > > > > > -----Original Message----- > > From: Ted Yu [mailto:[EMAIL PROTECTED]] > > Sent: Friday, July 16, 2010 9:44 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Hadoop and XML > > > > From an earlier post: > > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < > > [EMAIL PROTECTED]> wrote: > > > > > Moving the variable to a local variable did not seem to work: > > > > > > > > > </PrivateRateSet>vateRateSet> > > > > > > > > > > > > public void map(Object key, Object value, OutputCollector output, > > > Reporter > > > reporter) throws IOException { > > > Text valueText = (Text)value; > > > String valueString = new String(valueText.getBytes(), > > > "UTF-8"); > > > String keyString = getXmlKey(valueString); > > > Text returnKeyText = new Text(); > > > Text returnValueText = new Text(); > > > returnKeyText.set(keyString); > > > returnValueText.set(valueString); > > > output.collect(returnKeyText, returnValueText); } > > > > > > -----Original Message----- > > > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > > > Sent: Fri 7/16/2010 2:51 PM > > > To: [EMAIL PROTECTED] > > > Subject: RE: Hadoop and XML > > > > > > Whoops....right after I sent it and someone else made a suggestion; I > > > realized what question 2 was about. I can try that, but wouldn't that > > > > > cause Object bloat? During the Hadoop training I went through; it was > > > > > mentioned to reuse the returning Key and Value objects to keep the > > > number of Objects created down to a minimum. Is this not really a > > > valid point? > > > > > > > > > > > > -----Original Message----- > > > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > > > Sent: Friday, July 16, 2010 2:44 PM > > > To: [EMAIL PROTECTED] > > > Subject: RE: Hadoop and XML > > > > > > > > > I am not using multi-threaded Map tasks. Also, if I understand your > > > second question correctly: > > > "Also can you try creating the output key and values in the map > > > method(method lacal) ?" > > > In the first code snippet I am doing exactly that. > > > > > > Below is the class that runs the Job. > > > > > > public class HadoopJobClient { > > > > > > private static final Log LOGGER > > > LogFactory.getLog(Prds.class.getName()); > > > > > > public static void main(String[] args) { > > > JobConf conf = new JobConf(Prds.class); > > > > > > conf.set("xmlinput.start", "<PrivateRateSet>"); > > > conf.set("xmlinput.end", "</PrivateRateSet>"); +
Jeff Bean 2010-07-20, 13:01
-
Re: Hadoop and XMLTed Yu 2010-07-20, 15:56
Interesting.
String class is able to handle this scenario: 348 public String(byte[] data, String encoding) throws UnsupportedEncodingException { 349 this(data, 0, data.length, encoding); 350 } On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[EMAIL PROTECTED]> wrote: > I think the problem is here: > > String valueString = new String(valueText.getBytes(), "UTF-8"); > > Javadoc for Text says: > > *getBytes< > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29 > > > *() > Returns the raw bytes; however, only data up to > getLength()< > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29 > >is > valid. > > So try getting the length, truncating the byte array at the value returned > by getLength() and THEN converting it to a String. > > Jeff > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > For your initial question on Text.set(). > > Text.setCapacity() allocates new byte array. Since keepData is false, old > > data wouldn't be copied over. > > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo < > > [EMAIL PROTECTED]> wrote: > > > > > I am already using XmlInputFormat. The input into the Map phase is not > > > the problem. The problem lays in between the Map and Reduce phase. > > > > > > BTW - The article is correct. DO NOT USE StreamXmlRecordReader. > > > XmlInputFormat is a lot faster. From my testing, StreamXmlRecordReader > > > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat > was > > > under 2 minutes. (Using 2 Core, 8GB machines) > > > > > > > > > -----Original Message----- > > > From: Ted Yu [mailto:[EMAIL PROTECTED]] > > > Sent: Friday, July 16, 2010 9:44 PM > > > To: [EMAIL PROTECTED] > > > Subject: Re: Hadoop and XML > > > > > > From an earlier post: > > > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html > > > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Moving the variable to a local variable did not seem to work: > > > > > > > > > > > > </PrivateRateSet>vateRateSet> > > > > > > > > > > > > > > > > public void map(Object key, Object value, OutputCollector output, > > > > Reporter > > > > reporter) throws IOException { > > > > Text valueText = (Text)value; > > > > String valueString = new String(valueText.getBytes(), > > > > "UTF-8"); > > > > String keyString = getXmlKey(valueString); > > > > Text returnKeyText = new Text(); > > > > Text returnValueText = new Text(); > > > > returnKeyText.set(keyString); > > > > returnValueText.set(valueString); > > > > output.collect(returnKeyText, returnValueText); } > > > > > > > > -----Original Message----- > > > > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > > > > Sent: Fri 7/16/2010 2:51 PM > > > > To: [EMAIL PROTECTED] > > > > Subject: RE: Hadoop and XML > > > > > > > > Whoops....right after I sent it and someone else made a suggestion; I > > > > realized what question 2 was about. I can try that, but wouldn't > that > > > > > > > cause Object bloat? During the Hadoop training I went through; it > was > > > > > > > mentioned to reuse the returning Key and Value objects to keep the > > > > number of Objects created down to a minimum. Is this not really a > > > > valid point? > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > > > > Sent: Friday, July 16, 2010 2:44 PM > > > > To: [EMAIL PROTECTED] > > > > Subject: RE: Hadoop and XML > > > > > > > > > > > > I am not using multi-threaded Map tasks. Also, if I understand your > > > > second question correctly: > > > > "Also can you try creating the output key and values in the map > > > > method(method lacal) ?" +
Ted Yu 2010-07-20, 15:56
-
Re: Hadoop and XMLJeff Bean 2010-07-20, 16:23
data.length is the length of the byte array.
Text.getLength() most likely returns a different value than getBytes.length. Hadoop reuses box class objects like Text, so what it's probably doing is writing over the byte array, lengthening it as necessary, and just updating a separate length attribute. Jeff On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > Interesting. > String class is able to handle this scenario: > > 348 public String(byte[] data, String encoding) throws > UnsupportedEncodingException { > 349 this(data, 0, data.length, encoding); > 350 } > > > > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[EMAIL PROTECTED]> wrote: > > > I think the problem is here: > > > > String valueString = new String(valueText.getBytes(), "UTF-8"); > > > > Javadoc for Text says: > > > > *getBytes< > > > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29 > > > > > *() > > Returns the raw bytes; however, only data up to > > getLength()< > > > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29 > > >is > > valid. > > > > So try getting the length, truncating the byte array at the value > returned > > by getLength() and THEN converting it to a String. > > > > Jeff > > > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > For your initial question on Text.set(). > > > Text.setCapacity() allocates new byte array. Since keepData is false, > old > > > data wouldn't be copied over. > > > > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo < > > > [EMAIL PROTECTED]> wrote: > > > > > > > I am already using XmlInputFormat. The input into the Map phase is > not > > > > the problem. The problem lays in between the Map and Reduce phase. > > > > > > > > BTW - The article is correct. DO NOT USE StreamXmlRecordReader. > > > > XmlInputFormat is a lot faster. From my testing, > StreamXmlRecordReader > > > > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat > > was > > > > under 2 minutes. (Using 2 Core, 8GB machines) > > > > > > > > > > > > -----Original Message----- > > > > From: Ted Yu [mailto:[EMAIL PROTECTED]] > > > > Sent: Friday, July 16, 2010 9:44 PM > > > > To: [EMAIL PROTECTED] > > > > Subject: Re: Hadoop and XML > > > > > > > > From an earlier post: > > > > > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html > > > > > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > Moving the variable to a local variable did not seem to work: > > > > > > > > > > > > > > > </PrivateRateSet>vateRateSet> > > > > > > > > > > > > > > > > > > > > public void map(Object key, Object value, OutputCollector output, > > > > > Reporter > > > > > reporter) throws IOException { > > > > > Text valueText = (Text)value; > > > > > String valueString = new > String(valueText.getBytes(), > > > > > "UTF-8"); > > > > > String keyString = getXmlKey(valueString); > > > > > Text returnKeyText = new Text(); > > > > > Text returnValueText = new Text(); > > > > > returnKeyText.set(keyString); > > > > > returnValueText.set(valueString); > > > > > output.collect(returnKeyText, returnValueText); } > > > > > > > > > > -----Original Message----- > > > > > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > > > > > Sent: Fri 7/16/2010 2:51 PM > > > > > To: [EMAIL PROTECTED] > > > > > Subject: RE: Hadoop and XML > > > > > > > > > > Whoops....right after I sent it and someone else made a suggestion; > I > > > > > realized what question 2 was about. I can try that, but wouldn't > > that > > > > > > > > > cause Object bloat? During the Hadoop training I went through; it > > was > > > > > > > > > mentioned to reuse the returning Key and Value objects to keep the +
Jeff Bean 2010-07-20, 16:23
-
Re: Hadoop and XMLTed Yu 2010-07-20, 16:38
So the correct call should be:
String valueString = new String(valueText.getBytes(), 0, valueText.getLength(), "UTF-8"); Cheers On Tue, Jul 20, 2010 at 9:23 AM, Jeff Bean <[EMAIL PROTECTED]> wrote: > data.length is the length of the byte array. > > Text.getLength() most likely returns a different value than > getBytes.length. > > Hadoop reuses box class objects like Text, so what it's probably doing is > writing over the byte array, lengthening it as necessary, and just updating > a separate length attribute. > > Jeff > > On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Interesting. > > String class is able to handle this scenario: > > > > 348 public String(byte[] data, String encoding) throws > > UnsupportedEncodingException { > > 349 this(data, 0, data.length, encoding); > > 350 } > > > > > > > > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[EMAIL PROTECTED]> wrote: > > > > > I think the problem is here: > > > > > > String valueString = new String(valueText.getBytes(), "UTF-8"); > > > > > > Javadoc for Text says: > > > > > > *getBytes< > > > > > > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29 > > > > > > > *() > > > Returns the raw bytes; however, only data up to > > > getLength()< > > > > > > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29 > > > >is > > > valid. > > > > > > So try getting the length, truncating the byte array at the value > > returned > > > by getLength() and THEN converting it to a String. > > > > > > Jeff > > > > > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > > > For your initial question on Text.set(). > > > > Text.setCapacity() allocates new byte array. Since keepData is false, > > old > > > > data wouldn't be copied over. > > > > > > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > I am already using XmlInputFormat. The input into the Map phase is > > not > > > > > the problem. The problem lays in between the Map and Reduce phase. > > > > > > > > > > BTW - The article is correct. DO NOT USE StreamXmlRecordReader. > > > > > XmlInputFormat is a lot faster. From my testing, > > StreamXmlRecordReader > > > > > took 8 minutes to read a 1 GB XML document; where as, > XmlInputFormat > > > was > > > > > under 2 minutes. (Using 2 Core, 8GB machines) > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Ted Yu [mailto:[EMAIL PROTECTED]] > > > > > Sent: Friday, July 16, 2010 9:44 PM > > > > > To: [EMAIL PROTECTED] > > > > > Subject: Re: Hadoop and XML > > > > > > > > > > From an earlier post: > > > > > > > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html > > > > > > > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > > > Moving the variable to a local variable did not seem to work: > > > > > > > > > > > > > > > > > > </PrivateRateSet>vateRateSet> > > > > > > > > > > > > > > > > > > > > > > > > public void map(Object key, Object value, OutputCollector output, > > > > > > Reporter > > > > > > reporter) throws IOException { > > > > > > Text valueText = (Text)value; > > > > > > String valueString = new > > String(valueText.getBytes(), > > > > > > "UTF-8"); > > > > > > String keyString = getXmlKey(valueString); > > > > > > Text returnKeyText = new Text(); > > > > > > Text returnValueText = new Text(); > > > > > > returnKeyText.set(keyString); > > > > > > returnValueText.set(valueString); > > > > > > output.collect(returnKeyText, returnValueText); } > > > > > > > > > > > > -----Original Message----- > > > > > > From: Peter Minearo [mailto:[EMAIL PROTECTED]] > > > > > > Sent: Fri 7/16/2010 2 +
Ted Yu 2010-07-20, 16:38
-
Re: Hadoop and XMLTed Yu 2010-07-20, 16:50
I also added Peter's comment to the JIRA I logged:
https://issues.apache.org/jira/browse/HADOOP-6868 On Tue, Jul 20, 2010 at 9:38 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > So the correct call should be: > String valueString = new String(valueText.getBytes(), 0, > valueText.getLength(), "UTF-8"); > > Cheers > > > On Tue, Jul 20, 2010 at 9:23 AM, Jeff Bean <[EMAIL PROTECTED]> wrote: > >> data.length is the length of the byte array. >> >> Text.getLength() most likely returns a different value than >> getBytes.length. >> >> Hadoop reuses box class objects like Text, so what it's probably doing is >> writing over the byte array, lengthening it as necessary, and just >> updating >> a separate length attribute. >> >> Jeff >> >> On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote: >> >> > Interesting. >> > String class is able to handle this scenario: >> > >> > 348 public String(byte[] data, String encoding) throws >> > UnsupportedEncodingException { >> > 349 this(data, 0, data.length, encoding); >> > 350 } >> > >> > >> > >> > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[EMAIL PROTECTED]> >> wrote: >> > >> > > I think the problem is here: >> > > >> > > String valueString = new String(valueText.getBytes(), "UTF-8"); >> > > >> > > Javadoc for Text says: >> > > >> > > *getBytes< >> > > >> > >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29 >> > > > >> > > *() >> > > Returns the raw bytes; however, only data up to >> > > getLength()< >> > > >> > >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29 >> > > >is >> > > valid. >> > > >> > > So try getting the length, truncating the byte array at the value >> > returned >> > > by getLength() and THEN converting it to a String. >> > > >> > > Jeff >> > > >> > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote: >> > > >> > > > For your initial question on Text.set(). >> > > > Text.setCapacity() allocates new byte array. Since keepData is >> false, >> > old >> > > > data wouldn't be copied over. >> > > > >> > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo < >> > > > [EMAIL PROTECTED]> wrote: >> > > > >> > > > > I am already using XmlInputFormat. The input into the Map phase >> is >> > not >> > > > > the problem. The problem lays in between the Map and Reduce >> phase. >> > > > > >> > > > > BTW - The article is correct. DO NOT USE StreamXmlRecordReader. >> > > > > XmlInputFormat is a lot faster. From my testing, >> > StreamXmlRecordReader >> > > > > took 8 minutes to read a 1 GB XML document; where as, >> XmlInputFormat >> > > was >> > > > > under 2 minutes. (Using 2 Core, 8GB machines) >> > > > > >> > > > > >> > > > > -----Original Message----- >> > > > > From: Ted Yu [mailto:[EMAIL PROTECTED]] >> > > > > Sent: Friday, July 16, 2010 9:44 PM >> > > > > To: [EMAIL PROTECTED] >> > > > > Subject: Re: Hadoop and XML >> > > > > >> > > > > From an earlier post: >> > > > > >> > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html >> > > > > >> > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < >> > > > > [EMAIL PROTECTED]> wrote: >> > > > > >> > > > > > Moving the variable to a local variable did not seem to work: >> > > > > > >> > > > > > >> > > > > > </PrivateRateSet>vateRateSet> >> > > > > > >> > > > > > >> > > > > > >> > > > > > public void map(Object key, Object value, OutputCollector >> output, >> > > > > > Reporter >> > > > > > reporter) throws IOException { >> > > > > > Text valueText = (Text)value; >> > > > > > String valueString = new >> > String(valueText.getBytes(), >> > > > > > "UTF-8"); >> > > > > > String keyString = getXmlKey(valueString); >> > > > > > Text returnKeyText = new Text(); >> > > > > > Text returnValueText = new Text(); >> > > > > > returnKeyText.set(keyString); +
Ted Yu 2010-07-20, 16:50
-
RE: Hadoop and XMLPeter Minearo 2010-07-20, 16:35
That is exacly what is happening. This is the code from the Text class.
public void set(String string) { try { ByteBuffer bb = encode(string, true); bytes = bb.array(); length = bb.limit(); }catch(CharacterCodingException e) { throw new RuntimeException("Should not have happened " + e.toString()); } } This sounds like a bug. Let's say you create a Text object and drop in a String that sets the byte array length to 200. Then drop in a a second String that sets the byte array length to 500. Since, the new length is greater than the previous length; the byte array length is reset to the longer length. Now, if you drop in a third String that would set the byte array length to 350; the Text object does not replace the byte array with a new length of 350; it utilizes the greater length of 500 and sets an extra variable to track the "real" length. So: Text.getBytes().length != Text.getLength() This does 2 things: 1. Passes around more data than what is needed 2. Makes the Text object confusing to work with Text.getBytes().length == Text.getLength() - should be the correct behavior. -----Original Message----- From: Jeff Bean [mailto:[EMAIL PROTECTED]] Sent: Tue 7/20/2010 9:23 AM To: [EMAIL PROTECTED] Subject: Re: Hadoop and XML data.length is the length of the byte array. Text.getLength() most likely returns a different value than getBytes.length. Hadoop reuses box class objects like Text, so what it's probably doing is writing over the byte array, lengthening it as necessary, and just updating a separate length attribute. Jeff On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > Interesting. > String class is able to handle this scenario: > > 348 public String(byte[] data, String encoding) throws > UnsupportedEncodingException { > 349 this(data, 0, data.length, encoding); > 350 } > > > > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[EMAIL PROTECTED]> wrote: > > > I think the problem is here: > > > > String valueString = new String(valueText.getBytes(), "UTF-8"); > > > > Javadoc for Text says: > > > > *getBytes< > > > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29 > > > > > *() > > Returns the raw bytes; however, only data up to > > getLength()< > > > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29 > > >is > > valid. > > > > So try getting the length, truncating the byte array at the value > returned > > by getLength() and THEN converting it to a String. > > > > Jeff > > > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > For your initial question on Text.set(). > > > Text.setCapacity() allocates new byte array. Since keepData is false, > old > > > data wouldn't be copied over. > > > > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo < > > > [EMAIL PROTECTED]> wrote: > > > > > > > I am already using XmlInputFormat. The input into the Map phase is > not > > > > the problem. The problem lays in between the Map and Reduce phase. > > > > > > > > BTW - The article is correct. DO NOT USE StreamXmlRecordReader. > > > > XmlInputFormat is a lot faster. From my testing, > StreamXmlRecordReader > > > > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat > > was > > > > under 2 minutes. (Using 2 Core, 8GB machines) > > > > > > > > > > > > -----Original Message----- > > > > From: Ted Yu [mailto:[EMAIL PROTECTED]] > > > > Sent: Friday, July 16, 2010 9:44 PM > > > > To: [EMAIL PROTECTED] > > > > Subject: Re: Hadoop and XML > > > > > > > > From an earlier post: > > > > > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html > > > > > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > Moving the variable to a local variable did not seem to work: > > > > > > > > > > +
Peter Minearo 2010-07-20, 16:35
-
Re: Hadoop and XMLScott Carey 2010-07-20, 18:24
> > This sounds like a bug. > > Let's say you create a Text object and drop in a String that sets the byte array length to 200. Then drop in a a second String that sets the byte array length to 500. Since, the new length is greater than the previous length; the byte array length is reset to the longer length. Now, if you drop in a third String that would set the byte array length to 350; the Text object does not replace the byte array with a new length of 350; it utilizes the greater length of 500 and sets an extra variable to track the "real" length. > > So: Text.getBytes().length != Text.getLength() > > This does 2 things: > > 1. Passes around more data than what is needed > 2. Makes the Text object confusing to work with > > Text.getBytes().length == Text.getLength() - should be the correct behavior. > > I don't think so. Passing around byte arrays larger than the valid data is common practice in Java for performance reasons. Hence, the common method signature containing (byte[] bytes, int len, int offset) and similar. Creating a new byte array for each resize defeats the purpose of re-using the byte array and the Text object -- lower memory allocation and improved CPU cache locality. The byte array here is a buffer, it does not represent the entire string. +
Scott Carey 2010-07-20, 18:24
-
Re: Hadoop and XMLScott Carey 2010-07-20, 18:29
On Jul 20, 2010, at 11:24 AM, Scott Carey wrote: > >> >> This sounds like a bug. >> >> Let's say you create a Text object and drop in a String that sets the byte array length to 200. Then drop in a a second String that sets the byte array length to 500. Since, the new length is greater than the previous length; the byte array length is reset to the longer length. Now, if you drop in a third String that would set the byte array length to 350; the Text object does not replace the byte array with a new length of 350; it utilizes the greater length of 500 and sets an extra variable to track the "real" length. >> >> So: Text.getBytes().length != Text.getLength() >> >> This does 2 things: >> >> 1. Passes around more data than what is needed >> 2. Makes the Text object confusing to work with >> >> Text.getBytes().length == Text.getLength() - should be the correct behavior. >> >> > > I don't think so. Passing around byte arrays larger than the valid data is common practice in Java for performance reasons. Hence, the common method signature containing (byte[] bytes, int len, int offset) and similar. Creating a new byte array for each resize defeats the purpose of re-using the byte array and the Text object -- lower memory allocation and improved CPU cache locality. The byte array here is a buffer, it does not represent the entire string. > To be more specific here, shouldn't Text.toString() do the trick? If Text.toString() doesn't work and does something other than what you expect here, it should be documented and that class should have another helper method that gets you a String from Text. Calling getBytes() and manually constructing a string means you should know what those bytes represent -- a buffer where the bytes for the string are from index - to Text.getLength(). +
Scott Carey 2010-07-20, 18:29
|