Julian Bui 2013-02-07, 01:13
-Re: Creating files through the hadoop streaming interface
Harsh J 2013-02-07, 05:18
The raw streaming interface has much issues of this manner. The python
open(…, 'w') calls won't open files on HDFS, further. Perhaps, since
you wish to use Python for its various advantages, check out this
detailed comparison guide of various Python-based Hadoop frameworks
(including the raw streaming we offer as part of Apache Hadoop) at
by Uri? Many of these provide python extensions to HDFS/etc., letting
you do much more than plain streaming.
On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <[EMAIL PROTECTED]> wrote:
> Hi hadoop users,
> I am trying to use the streaming interface to use my python script mapper to
> create some files but am running into difficulties actually creating files
> on the hdfs.
> I have a python script mapper with no reducers. Currently, it doesn't even
> read the input and instead reads in the env variable for the output dir
> (outdir = os.environ['mapred_output_dir']) and attempts to create an empty
> file at that location. However, that appears to fail with the [vague] error
> message appended to this email.
> I am using the streaming interface because the python file examples seem so
> much cleaner and abstract a lot of the details away for me but if I instead
> need to use the java bindings (and create a mapper and reducer class) then
> please let me know. I'm still learning hadoop. As I understand it, I
> should be able to create files in hadoop but perhaps there is limited
> ability while using the streaming i/o interface.
> Further questions: If my mapper absolutely must send my output to stdout, is
> there a way to rename the file after it has been created?
> Please help.
> Python mapper code:
> outdir = os.environ['mapred_output_dir']
> f = open(outdir + "/testfile.txt", "wb")
> 13/02/06 17:07:55 INFO streaming.StreamJob: map 100% reduce 100%
> 13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
> 13/02/06 17:07:55 INFO streaming.StreamJob:
> /opt/hadoop/libexec/../bin/hadoop job
> -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
> 13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
> 13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
> 13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
> Streaming Command Failed!
Simone Leo 2013-02-07, 15:39