Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Creating files through the hadoop streaming interface


+
Julian Bui 2013-02-07, 01:13
+
Harsh J 2013-02-07, 05:18
Copy link to this message
-
Re: Creating files through the hadoop streaming interface
Hello,

the lack of an HDFS API is just one of the drawbacks that motivated us
to abandon Streaming and develop Pydoop.  Unfortunately, in the blog
post cited by Harsh J, Pydoop is just briefly mentioned because the
author failed to build and install it.

Here is how you solve your problem in Pydoop (for details on how to run
programs, see the docs at http://pydoop.sourceforge.net/docs):

import pydoop.pipes as pp
import pydoop.hdfs as hdfs

class Mapper(pp.Mapper):

   def __init__(self, context):
     super(Mapper, self).__init__(context)
     jc = context.getJobConf()
     fname = "%s/%s" % (jc.get("mapred.output.dir"),
jc.get("mapred.task.id"))
     self.fo = hdfs.open(fname, "w", user="simleo")
     self.fo.close()
     self.fo = hdfs.open(fname, "a", user="simleo")

   def map(self, context):
     l = len(context.getInputValue())
     self.fo.write("%d\n" % l)

   def close(self):
     self.fo.close()

class Reducer(pp.Reducer):
   pass

if __name__ == "__main__":
   pp.runTask(pp.Factory(Mapper, Reducer))

Note that I'm embedding the task attempt info into the file name, to
avoid clashes due to different mappers trying to access the same file at
the same time.

Simone

On 02/07/2013 06:18 AM, Harsh J wrote:
> The raw streaming interface has much issues of this manner. The python
> open(�, 'w') calls won't open files on HDFS, further. Perhaps, since
> you wish to use Python for its various advantages, check out this
> detailed comparison guide of various Python-based Hadoop frameworks
> (including the raw streaming we offer as part of Apache Hadoop) at
> http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
> by Uri? Many of these provide python extensions to HDFS/etc., letting
> you do much more than plain streaming.
>
> On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <[EMAIL PROTECTED]> wrote:
>> Hi hadoop users,
>>
>> I am trying to use the streaming interface to use my python script mapper to
>> create some files but am running into difficulties actually creating files
>> on the hdfs.
>>
>> I have a python script mapper with no reducers.  Currently, it doesn't even
>> read the input and instead reads in the env variable for the output dir
>> (outdir = os.environ['mapred_output_dir']) and attempts to create an empty
>> file at that location.  However, that appears to fail with the [vague] error
>> message appended to this email.
>>
>> I am using the streaming interface because the python file examples seem so
>> much cleaner and abstract a lot of the details away for me but if I instead
>> need to use the java bindings (and create a mapper and reducer class) then
>> please let me know.  I'm still learning hadoop.  As I understand it, I
>> should be able to create files in hadoop but perhaps there is limited
>> ability while using the streaming i/o interface.
>>
>> Further questions: If my mapper absolutely must send my output to stdout, is
>> there a way to rename the file after it has been created?
>>
>> Please help.
>>
>> Thanks,
>> -Julian
>>
>> Python mapper code:
>> outdir = os.environ['mapred_output_dir']
>> f = open(outdir + "/testfile.txt", "wb")
>> f.close()
>>
>>
>> 13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
>> 13/02/06 17:07:55 INFO streaming.StreamJob:
>> /opt/hadoop/libexec/../bin/hadoop job
>> -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
>> 13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
>> http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
>> 13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
>> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
>> task_201302061706_0001_m_000000
>> 13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
>> Streaming Command Failed!
>>
>
>
>
> --
> Harsh J
>

--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: [EMAIL PROTECTED]
http://www.crs4.it
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB