|
Håvard Wahl Kongsgård
2012-02-13, 18:39
Rohit
2012-02-13, 19:52
Håvard Wahl Kongsgård
2012-02-13, 21:40
Harsh J
2012-02-14, 14:01
Håvard Wahl Kongsgård
2012-02-15, 12:13
|
-
Hadoop scripting when to use dfs -putHåvard Wahl Kongsgård 2012-02-13, 18:39
Hi, I originally posted this on the dumbo forum, but it's more a
general scripting hadoop issue. When testing a simple script that created some local files and then copied them to hdfs with os.system("hadoop dfs -put /home/havard/bio_sci/file.json /tmp/bio_sci/file.json") the tasks fail with out of heap memory. The files are tiny, and I have tried increasing the heap size. When skipping the hadoop dfs -put, the tasks do not fail. Is it wrong to use hadoop dfs -put inside running a script with hadoop? Should I instead transfer the files at the end with a combiner, or simply mount hdfs locally and write directly to hdfs? Any general suggestions? -- Håvard Wahl Kongsgård NTNU http://havard.security-review.net/
-
Re: Hadoop scripting when to use dfs -putRohit 2012-02-13, 19:52
Hi,
What threw the heap error? Was it the Java VM, or the shell environment? It would be good to look at free RAM memory on your system before and after you ran the script as well, to see if your system is running low on memory. Are you using a recursive loop in your script? Thanks, Rohit Rohit Bakhshi www.hortonworks.com (http://www.hortonworks.com/) On Monday, February 13, 2012 at 10:39 AM, Håvard Wahl Kongsgård wrote: > Hi, I originally posted this on the dumbo forum, but it's more a > general scripting hadoop issue. > > When testing a simple script that created some local files > and then copied them to hdfs > with os.system("hadoop dfs -put /home/havard/bio_sci/file.json > /tmp/bio_sci/file.json") > > the tasks fail with out of heap memory. The files are tiny, and I have > tried increasing the > heap size. When skipping the hadoop dfs -put, the tasks do not fail. > > Is it wrong to use hadoop dfs -put inside running a script with > hadoop? Should I instead > transfer the files at the end with a combiner, or simply mount hdfs > locally and write directly to hdfs? Any general suggestions? > > > -- > Håvard Wahl Kongsgård > NTNU > > http://havard.security-review.net/
-
Re: Hadoop scripting when to use dfs -putHåvard Wahl Kongsgård 2012-02-13, 21:40
My environment heap size varies from 18GB to 2GB
in mapred-site.xml mapred.child.java.opts = -Xmx512M System Ubuntu 10.04 LTS, java-6-sun-1.6.0.26, ,latest cloudera version of hadoop This log from the tasklog Original exception was: java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:376) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:212) at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152) at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51) at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:418) I don't have a recursive loop like while or something else my dumbo code multi_tree() is just a simple function where the error handling is try: except: pass def mapper(key, value): v = value.split(" ")[0] yield multi_tree(v),1 if __name__ == "__main__": import dumbo dumbo.run(mapper) -Håvard On Mon, Feb 13, 2012 at 8:52 PM, Rohit <[EMAIL PROTECTED]> wrote: > Hi, > > What threw the heap error? Was it the Java VM, or the shell environment? > > It would be good to look at free RAM memory on your system before and after you ran the script as well, to see if your system is running low on memory. > > Are you using a recursive loop in your script? > > Thanks, > Rohit > > > Rohit Bakhshi > > > > > > www.hortonworks.com (http://www.hortonworks.com/) > > > > > > On Monday, February 13, 2012 at 10:39 AM, Håvard Wahl Kongsgård wrote: > >> Hi, I originally posted this on the dumbo forum, but it's more a >> general scripting hadoop issue. >> >> When testing a simple script that created some local files >> and then copied them to hdfs >> with os.system("hadoop dfs -put /home/havard/bio_sci/file.json >> /tmp/bio_sci/file.json") >> >> the tasks fail with out of heap memory. The files are tiny, and I have >> tried increasing the >> heap size. When skipping the hadoop dfs -put, the tasks do not fail. >> >> Is it wrong to use hadoop dfs -put inside running a script with >> hadoop? Should I instead >> transfer the files at the end with a combiner, or simply mount hdfs >> locally and write directly to hdfs? Any general suggestions? >> >> >> -- >> Håvard Wahl Kongsgård >> NTNU >> >> http://havard.security-review.net/ > -- Håvard Wahl Kongsgård NTNU http://havard.security-review.net/
-
Re: Hadoop scripting when to use dfs -putHarsh J 2012-02-14, 14:01
For the sake of http://xkcd.com/979/, and since this was cross posted,
Håvard managed to solve this specific issue via Joey's response at https://groups.google.com/a/cloudera.org/group/cdh-user/msg/c55760868efa32e2 2012/2/14 Håvard Wahl Kongsgård <[EMAIL PROTECTED]>: > My environment heap size varies from 18GB to 2GB > in mapred-site.xml mapred.child.java.opts = -Xmx512M > > System Ubuntu 10.04 LTS, java-6-sun-1.6.0.26, ,latest cloudera version of hadoop > > > This log from the tasklog > Original exception was: > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space > at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:376) > at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) > at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) > at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > Caused by: java.lang.OutOfMemoryError: Java heap space > at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:212) > at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152) > at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51) > at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:418) > > > I don't have a recursive loop like while or something else > > my dumbo code > > multi_tree() is just a simple function > > where the error handling is > try: > except: > pass > > def mapper(key, value): > v = value.split(" ")[0] > yield multi_tree(v),1 > > > if __name__ == "__main__": > import dumbo > dumbo.run(mapper) > > > -Håvard > > > On Mon, Feb 13, 2012 at 8:52 PM, Rohit <[EMAIL PROTECTED]> wrote: >> Hi, >> >> What threw the heap error? Was it the Java VM, or the shell environment? >> >> It would be good to look at free RAM memory on your system before and after you ran the script as well, to see if your system is running low on memory. >> >> Are you using a recursive loop in your script? >> >> Thanks, >> Rohit >> >> >> Rohit Bakhshi >> >> >> >> >> >> www.hortonworks.com (http://www.hortonworks.com/) >> >> >> >> >> >> On Monday, February 13, 2012 at 10:39 AM, Håvard Wahl Kongsgård wrote: >> >>> Hi, I originally posted this on the dumbo forum, but it's more a >>> general scripting hadoop issue. >>> >>> When testing a simple script that created some local files >>> and then copied them to hdfs >>> with os.system("hadoop dfs -put /home/havard/bio_sci/file.json >>> /tmp/bio_sci/file.json") >>> >>> the tasks fail with out of heap memory. The files are tiny, and I have >>> tried increasing the >>> heap size. When skipping the hadoop dfs -put, the tasks do not fail. >>> >>> Is it wrong to use hadoop dfs -put inside running a script with >>> hadoop? Should I instead >>> transfer the files at the end with a combiner, or simply mount hdfs >>> locally and write directly to hdfs? Any general suggestions? >>> >>> >>> -- >>> Håvard Wahl Kongsgård >>> NTNU >>> >>> http://havard.security-review.net/ >> > > > > -- > Håvard Wahl Kongsgård > NTNU > > http://havard.security-review.net/ -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
-
Re: Hadoop scripting when to use dfs -putHåvard Wahl Kongsgård 2012-02-15, 12:13
Sorry for cross posting again. There is still something strange with
the dfs client and python. With the very simple code below, I get no errors, but no output in /tmp/bio_sci/ I could use FUSE, but this issue should be of general interest to users of hadoop/ python users. Can anyone replicate this? def multi_tree(value): os.system("hadoop dfs -touchz /tmp/bio_sci/"+str(value)+" > /dev/null 2> /dev/null") def mapper(key, value): v = value.split(" ")[0] yield multi_tree(v),1 if __name__ == "__main__": import dumbo dumbo.run(mapper) -Håvard On Tue, Feb 14, 2012 at 3:01 PM, Harsh J <[EMAIL PROTECTED]> wrote: > For the sake of http://xkcd.com/979/, and since this was cross posted, > Håvard managed to solve this specific issue via Joey's response at > https://groups.google.com/a/cloudera.org/group/cdh-user/msg/c55760868efa32e2 > > 2012/2/14 Håvard Wahl Kongsgård <[EMAIL PROTECTED]>: >> My environment heap size varies from 18GB to 2GB >> in mapred-site.xml mapred.child.java.opts = -Xmx512M >> >> System Ubuntu 10.04 LTS, java-6-sun-1.6.0.26, ,latest cloudera version of hadoop >> >> >> This log from the tasklog >> Original exception was: >> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space >> at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:376) >> at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) >> at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) >> at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) >> at org.apache.hadoop.mapred.Child$4.run(Child.java:270) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) >> at org.apache.hadoop.mapred.Child.main(Child.java:264) >> Caused by: java.lang.OutOfMemoryError: Java heap space >> at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:212) >> at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152) >> at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51) >> at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:418) >> >> >> I don't have a recursive loop like while or something else >> >> my dumbo code >> >> multi_tree() is just a simple function >> >> where the error handling is >> try: >> except: >> pass >> >> def mapper(key, value): >> v = value.split(" ")[0] >> yield multi_tree(v),1 >> >> >> if __name__ == "__main__": >> import dumbo >> dumbo.run(mapper) >> >> >> -Håvard >> >> >> On Mon, Feb 13, 2012 at 8:52 PM, Rohit <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> What threw the heap error? Was it the Java VM, or the shell environment? >>> >>> It would be good to look at free RAM memory on your system before and after you ran the script as well, to see if your system is running low on memory. >>> >>> Are you using a recursive loop in your script? >>> >>> Thanks, >>> Rohit >>> >>> >>> Rohit Bakhshi >>> >>> >>> >>> >>> >>> www.hortonworks.com (http://www.hortonworks.com/) >>> >>> >>> >>> >>> >>> On Monday, February 13, 2012 at 10:39 AM, Håvard Wahl Kongsgård wrote: >>> >>>> Hi, I originally posted this on the dumbo forum, but it's more a >>>> general scripting hadoop issue. >>>> >>>> When testing a simple script that created some local files >>>> and then copied them to hdfs >>>> with os.system("hadoop dfs -put /home/havard/bio_sci/file.json >>>> /tmp/bio_sci/file.json") >>>> >>>> the tasks fail with out of heap memory. The files are tiny, and I have Håvard Wahl Kongsgård NTNU http://havard.security-review.net/ |