Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> How to import custom Python module in MapReduce job?


+
Andrei 2013-08-12, 06:50
+
Binglin Chang 2013-08-12, 08:33
Copy link to this message
-
Re: How to import custom Python module in MapReduce job?
Maybe you doesn't specify symlink name in you cmd line, so the symlink name
will be just lib.jar, so I am not sure how you import lib module in your
main.py file.
Please try this:
put main.py lib.py in same jar file, e.g.  app.zip
*-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
map" -reducer "app/main.py reduce"
in main.py:
import app.lib
or:
import .lib
On Mon, Aug 12, 2013 at 6:01 PM, Andrei <[EMAIL PROTECTED]> wrote:

> Hi Binglin,
>
> thanks for your explanation, now it makes sense. However, I'm not sure how
> to implement suggested method with.
>
> First of all, I found out that `-cachArchive` option is deprecated, so I
> had to use `-archives` instead. I put my `lib.py` to directory `lib` and
> then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
>  linked it in call to Streaming API as follows:
>
>   hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files
> main.py *-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper
> "./main.py map" -reducer "./main.py reduce" -combiner "./main.py combine"
> -input input -output output
>
> But script failed, and from logs I see that lib.jar hasn't been unpacked.
> What am I missing?
>
>
>
>
> On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <[EMAIL PROTECTED]>wrote:
>
>> Hi,
>>
>> The problem seems to caused by symlink, hadoop uses file cache, so every
>> file is in fact a symlink.
>>
>> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
>> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
>> [root@master01 tmp]# ./main.py
>> Traceback (most recent call last):
>>   File "./main.py", line 3, in ?
>>     import lib
>> ImportError: No module named lib
>>
>> This should be a python bug: when using import, it can't handle symlink
>>
>> You can try to use a directory containing lib.py and use -cacheArchive,
>> so the symlink actually links to a directory, python may handle this case
>> well.
>>
>> Thanks,
>> Binglin
>>
>>
>>
>> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <[EMAIL PROTECTED]>wrote:
>>
>>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>>> )
>>>
>>> I have a MapReduce job defined in file *main.py*, which imports module
>>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>>> Hadoop cluster as follows:
>>>
>>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>
>>>     -files lib.py,main.py
>>>     -mapper "./main.py map" -reducer "./main.py reduce"
>>>     -input input -output output
>>>
>>>  In my understanding, this should put both main.py and lib.py into *distributed
>>> cache folder* on each computing machine and thus make module lib available
>>> to main. But it doesn't happen - from log file I see, that files *are
>>> really copied* to the same directory, but main can't import lib,
>>> throwing*ImportError*.
>>>
>>> Adding current directory to the path didn't work:
>>>
>>> import sys
>>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>>
>>> though, loading module manually did the trick:
>>>
>>> import imp
>>> lib = imp.load_source('lib', 'lib.py')
>>>
>>>  But that's not what I want. So why Python interpreter can see other .py files
>>> in the same directory, but can't import them? Note, I have already tried
>>> adding empty __init__.py file to the same directory without effect.
>>>
>>>
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB