Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - How to import custom Python module in MapReduce job?


Copy link to this message
-
Re: How to import custom Python module in MapReduce job?
Binglin Chang 2013-08-12, 08:33
Hi,

The problem seems to caused by symlink, hadoop uses file cache, so every
file is in fact a symlink.

lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
[root@master01 tmp]# ./main.py
Traceback (most recent call last):
  File "./main.py", line 3, in ?
    import lib
ImportError: No module named lib

This should be a python bug: when using import, it can't handle symlink

You can try to use a directory containing lib.py and use -cacheArchive, so
the symlink actually links to a directory, python may handle this case well.

Thanks,
Binglin

On Mon, Aug 12, 2013 at 2:50 PM, Andrei <[EMAIL PROTECTED]> wrote:

> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
> )
>
> I have a MapReduce job defined in file *main.py*, which imports module lib from
> file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop
> cluster as follows:
>
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>
>     -files lib.py,main.py
>     -mapper "./main.py map" -reducer "./main.py reduce"
>     -input input -output output
>
>  In my understanding, this should put both main.py and lib.py into *distributed
> cache folder* on each computing machine and thus make module lib available
> to main. But it doesn't happen - from log file I see, that files *are
> really copied* to the same directory, but main can't import lib, throwing*
> ImportError*.
>
> Adding current directory to the path didn't work:
>
> import sys
> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>
> though, loading module manually did the trick:
>
> import imp
> lib = imp.load_source('lib', 'lib.py')
>
>  But that's not what I want. So why Python interpreter can see other .py files
> in the same directory, but can't import them? Note, I have already tried
> adding empty __init__.py file to the same directory without effect.
>
>
>