Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> python streaming error


Copy link to this message
-
Re: python streaming error
Hello,

you can use the Pydoop HDFS API to work with HDFS files:

 >>> import pydoop.hdfs as hdfs
 >>> with hdfs.open('hdfs://localhost:8020/user/myuser/filename') as f:
...     for line in f:
...             do_something(line)

As you can see, the API is very similar to that of ordinary Python file
objects.  Check out the following tutorial for more details:

http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html

Note that Pydoop also has a MapReduce API, so you can use it to rewrite
the whole program:

http://pydoop.sourceforge.net/docs/tutorial/mapred_api.html

It also has a more compact and easy-to-use scripting engine for simple
applications:

http://pydoop.sourceforge.net/docs/tutorial/pydoop_script.html

If you think Pydoop is right for you, read the installation guide:

http://pydoop.sourceforge.net/docs/installation.html

Simone

On 01/14/2013 11:24 PM, Andy Isaacson wrote:
> Oh, another link I should have included!
> http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
>
> -andy
>
> On Mon, Jan 14, 2013 at 2:19 PM, Andy Isaacson <[EMAIL PROTECTED]> wrote:
>> Hadoop Streaming does not magically teach Python open() how to read
>> from "hdfs://" URLs. You'll need to use a library or fork a "hdfs dfs
>> -cat" to read the file for you.
>>
>> A few links that may help:
>>
>> http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
>> http://stackoverflow.com/questions/12485718/python-read-file-as-stream-from-hdfs
>> https://bitbucket.org/turnaev/cyhdfs
>>
>> -andy
>>
>> On Sat, Jan 12, 2013 at 12:30 AM, springring <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>>       When I run code below as a streaming, the job error N/A and killed.  I run step by step, find it error when
>>> " file_obj = open(file) " .  When I run same code outside of hadoop, everything is ok.
>>>
>>>    1 #!/bin/env python
>>>    2
>>>    3 import sys
>>>    4
>>>    5 for line in sys.stdin:
>>>    6     offset,filename = line.split("\t")
>>>    7     file = "hdfs://user/hdfs/catalog3/" + filename
>>>    8     print line
>>>    9     print filename
>>>   10     print file
>>>   11     file_obj = open(file)
>>> ..................................
>>>

--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: [EMAIL PROTECTED]
http://www.crs4.it