Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> How to import custom Python module in MapReduce job?


Copy link to this message
-
Re: How to import custom Python module in MapReduce job?
Hi,

The problem seems to caused by symlink, hadoop uses file cache, so every
file is in fact a symlink.

lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
[root@master01 tmp]# ./main.py
Traceback (most recent call last):
  File "./main.py", line 3, in ?
    import lib
ImportError: No module named lib

This should be a python bug: when using import, it can't handle symlink

You can try to use a directory containing lib.py and use -cacheArchive, so
the symlink actually links to a directory, python may handle this case well.

Thanks,
Binglin

On Mon, Aug 12, 2013 at 2:50 PM, Andrei <[EMAIL PROTECTED]> wrote:

> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
> )
>
> I have a MapReduce job defined in file *main.py*, which imports module lib from
> file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop
> cluster as follows:
>
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>
>     -files lib.py,main.py
>     -mapper "./main.py map" -reducer "./main.py reduce"
>     -input input -output output
>
>  In my understanding, this should put both main.py and lib.py into *distributed
> cache folder* on each computing machine and thus make module lib available
> to main. But it doesn't happen - from log file I see, that files *are
> really copied* to the same directory, but main can't import lib, throwing*
> ImportError*.
>
> Adding current directory to the path didn't work:
>
> import sys
> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>
> though, loading module manually did the trick:
>
> import imp
> lib = imp.load_source('lib', 'lib.py')
>
>  But that's not what I want. So why Python interpreter can see other .py files
> in the same directory, but can't import them? Note, I have already tried
> adding empty __init__.py file to the same directory without effect.
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB