Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Hadoop scripting when to use dfs -put


Copy link to this message
-
Re: Hadoop scripting when to use dfs -put
Håvard Wahl Kongsgård 2012-02-13, 21:40
My environment heap size varies from 18GB to 2GB
in mapred-site.xml mapred.child.java.opts = -Xmx512M

System Ubuntu 10.04 LTS, java-6-sun-1.6.0.26, ,latest cloudera version of hadoop
This log from the tasklog
Original exception was:
java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:376)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:212)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51)
at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:418)
I don't have a recursive loop like while or something else

my dumbo code

multi_tree() is just a simple function

where the error handling is
try:
except:
pass

def mapper(key, value):
   v = value.split(" ")[0]
   yield multi_tree(v),1
if __name__ == "__main__":
   import dumbo
   dumbo.run(mapper)
-Håvard
On Mon, Feb 13, 2012 at 8:52 PM, Rohit <[EMAIL PROTECTED]> wrote:
> Hi,
>
> What threw the heap error? Was it the Java VM, or the shell environment?
>
> It would be good to look at free RAM memory on your system before and after you ran the script as well, to see if your system is running low on memory.
>
> Are you using a recursive loop in your script?
>
> Thanks,
> Rohit
>
>
> Rohit Bakhshi
>
>
>
>
>
> www.hortonworks.com (http://www.hortonworks.com/)
>
>
>
>
>
> On Monday, February 13, 2012 at 10:39 AM, Håvard Wahl Kongsgård wrote:
>
>> Hi, I originally posted this on the dumbo forum, but it's more a
>> general scripting hadoop issue.
>>
>> When testing a simple script that created some local files
>> and then copied them to hdfs
>> with os.system("hadoop dfs -put /home/havard/bio_sci/file.json
>> /tmp/bio_sci/file.json")
>>
>> the tasks fail with out of heap memory. The files are tiny, and I have
>> tried increasing the
>> heap size. When skipping the hadoop dfs -put, the tasks do not fail.
>>
>> Is it wrong to use hadoop dfs -put inside running a script with
>> hadoop? Should I instead
>> transfer the files at the end with a combiner, or simply mount hdfs
>> locally and write directly to hdfs? Any general suggestions?
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> NTNU
>>
>> http://havard.security-review.net/
>

--
Håvard Wahl Kongsgård
NTNU

http://havard.security-review.net/