Andrei 2013-08-12, 06:50
Binglin Chang 2013-08-12, 08:33
-Re: How to import custom Python module in MapReduce job?
Binglin Chang 2013-08-12, 10:12
Maybe you doesn't specify symlink name in you cmd line, so the symlink name
will be just lib.jar, so I am not sure how you import lib module in your
Please try this:
put main.py lib.py in same jar file, e.g. app.zip
*-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
map" -reducer "app/main.py reduce"
On Mon, Aug 12, 2013 at 6:01 PM, Andrei <[EMAIL PROTECTED]> wrote:
> Hi Binglin,
> thanks for your explanation, now it makes sense. However, I'm not sure how
> to implement suggested method with.
> First of all, I found out that `-cachArchive` option is deprecated, so I
> had to use `-archives` instead. I put my `lib.py` to directory `lib` and
> then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
> linked it in call to Streaming API as follows:
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files
> main.py *-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper
> "./main.py map" -reducer "./main.py reduce" -combiner "./main.py combine"
> -input input -output output
> But script failed, and from logs I see that lib.jar hasn't been unpacked.
> What am I missing?
> On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <[EMAIL PROTECTED]>wrote:
>> The problem seems to caused by symlink, hadoop uses file cache, so every
>> file is in fact a symlink.
>> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
>> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
>> [root@master01 tmp]# ./main.py
>> Traceback (most recent call last):
>> File "./main.py", line 3, in ?
>> import lib
>> ImportError: No module named lib
>> This should be a python bug: when using import, it can't handle symlink
>> You can try to use a directory containing lib.py and use -cacheArchive,
>> so the symlink actually links to a directory, python may handle this case
>> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <[EMAIL PROTECTED]>wrote:
>>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>>> I have a MapReduce job defined in file *main.py*, which imports module
>>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>>> Hadoop cluster as follows:
>>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>> -files lib.py,main.py
>>> -mapper "./main.py map" -reducer "./main.py reduce"
>>> -input input -output output
>>> In my understanding, this should put both main.py and lib.py into *distributed
>>> cache folder* on each computing machine and thus make module lib available
>>> to main. But it doesn't happen - from log file I see, that files *are
>>> really copied* to the same directory, but main can't import lib,
>>> Adding current directory to the path didn't work:
>>> import sys
>>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>> though, loading module manually did the trick:
>>> import imp
>>> lib = imp.load_source('lib', 'lib.py')
>>> But that's not what I want. So why Python interpreter can see other .py files
>>> in the same directory, but can't import them? Note, I have already tried
>>> adding empty __init__.py file to the same directory without effect.