Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Hadoop scripting when to use dfs -put


Copy link to this message
-
Re: Hadoop scripting when to use dfs -put
Håvard Wahl Kongsgård 2012-02-15, 12:13
Sorry for cross posting again. There is still something strange with
the dfs client and python. With the very simple code below, I get no
errors, but no output in /tmp/bio_sci/

I could use FUSE, but this issue should be of general interest to
users of hadoop/ python users. Can anyone replicate this?

def multi_tree(value):
    os.system("hadoop dfs -touchz /tmp/bio_sci/"+str(value)+" >
/dev/null 2> /dev/null")

def mapper(key, value):
    v = value.split(" ")[0]
    yield multi_tree(v),1
if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper)

-Håvard
On Tue, Feb 14, 2012 at 3:01 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> For the sake of http://xkcd.com/979/, and since this was cross posted,
> Håvard managed to solve this specific issue via Joey's response at
> https://groups.google.com/a/cloudera.org/group/cdh-user/msg/c55760868efa32e2
>
> 2012/2/14 Håvard Wahl Kongsgård <[EMAIL PROTECTED]>:
>> My environment heap size varies from 18GB to 2GB
>> in mapred-site.xml mapred.child.java.opts = -Xmx512M
>>
>> System Ubuntu 10.04 LTS, java-6-sun-1.6.0.26, ,latest cloudera version of hadoop
>>
>>
>> This log from the tasklog
>> Original exception was:
>> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
>>        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:376)
>>        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
>>        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
>>        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:264)
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>        at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:212)
>>        at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
>>        at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51)
>>        at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:418)
>>
>>
>> I don't have a recursive loop like while or something else
>>
>> my dumbo code
>>
>> multi_tree() is just a simple function
>>
>> where the error handling is
>> try:
>> except:
>> pass
>>
>> def mapper(key, value):
>>   v = value.split(" ")[0]
>>   yield multi_tree(v),1
>>
>>
>> if __name__ == "__main__":
>>   import dumbo
>>   dumbo.run(mapper)
>>
>>
>> -Håvard
>>
>>
>> On Mon, Feb 13, 2012 at 8:52 PM, Rohit <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>> What threw the heap error? Was it the Java VM, or the shell environment?
>>>
>>> It would be good to look at free RAM memory on your system before and after you ran the script as well, to see if your system is running low on memory.
>>>
>>> Are you using a recursive loop in your script?
>>>
>>> Thanks,
>>> Rohit
>>>
>>>
>>> Rohit Bakhshi
>>>
>>>
>>>
>>>
>>>
>>> www.hortonworks.com (http://www.hortonworks.com/)
>>>
>>>
>>>
>>>
>>>
>>> On Monday, February 13, 2012 at 10:39 AM, Håvard Wahl Kongsgård wrote:
>>>
>>>> Hi, I originally posted this on the dumbo forum, but it's more a
>>>> general scripting hadoop issue.
>>>>
>>>> When testing a simple script that created some local files
>>>> and then copied them to hdfs
>>>> with os.system("hadoop dfs -put /home/havard/bio_sci/file.json
>>>> /tmp/bio_sci/file.json")
>>>>
>>>> the tasks fail with out of heap memory. The files are tiny, and I have

Håvard Wahl Kongsgård
NTNU

http://havard.security-review.net/