Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # dev >> python streaming error


+
springring 2013-01-12, 08:30
+
Nitin Pawar 2013-01-12, 08:34
+
springring 2013-01-12, 08:55
+
Nitin Pawar 2013-01-12, 08:58
+
springring 2013-01-14, 01:27
+
springring 2013-01-14, 01:53
+
Andy Isaacson 2013-01-14, 22:19
+
Andy Isaacson 2013-01-14, 22:24
Copy link to this message
-
Re: python streaming error
Hello,

you can use the Pydoop HDFS API to work with HDFS files:

 >>> import pydoop.hdfs as hdfs
 >>> with hdfs.open('hdfs://localhost:8020/user/myuser/filename') as f:
...     for line in f:
...             do_something(line)

As you can see, the API is very similar to that of ordinary Python file
objects.  Check out the following tutorial for more details:

http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html

Note that Pydoop also has a MapReduce API, so you can use it to rewrite
the whole program:

http://pydoop.sourceforge.net/docs/tutorial/mapred_api.html

It also has a more compact and easy-to-use scripting engine for simple
applications:

http://pydoop.sourceforge.net/docs/tutorial/pydoop_script.html

If you think Pydoop is right for you, read the installation guide:

http://pydoop.sourceforge.net/docs/installation.html

Simone

On 01/14/2013 11:24 PM, Andy Isaacson wrote:
> Oh, another link I should have included!
> http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
>
> -andy
>
> On Mon, Jan 14, 2013 at 2:19 PM, Andy Isaacson <[EMAIL PROTECTED]> wrote:
>> Hadoop Streaming does not magically teach Python open() how to read
>> from "hdfs://" URLs. You'll need to use a library or fork a "hdfs dfs
>> -cat" to read the file for you.
>>
>> A few links that may help:
>>
>> http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
>> http://stackoverflow.com/questions/12485718/python-read-file-as-stream-from-hdfs
>> https://bitbucket.org/turnaev/cyhdfs
>>
>> -andy
>>
>> On Sat, Jan 12, 2013 at 12:30 AM, springring <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>>       When I run code below as a streaming, the job error N/A and killed.  I run step by step, find it error when
>>> " file_obj = open(file) " .  When I run same code outside of hadoop, everything is ok.
>>>
>>>    1 #!/bin/env python
>>>    2
>>>    3 import sys
>>>    4
>>>    5 for line in sys.stdin:
>>>    6     offset,filename = line.split("\t")
>>>    7     file = "hdfs://user/hdfs/catalog3/" + filename
>>>    8     print line
>>>    9     print filename
>>>   10     print file
>>>   11     file_obj = open(file)
>>> ..................................
>>>

--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: [EMAIL PROTECTED]
http://www.crs4.it
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB