pydoop -- Python MapReduce and HDFS API for Hadoop
Simone Leo 2009-11-06, 17:20
Hello everybody,

we recently released pydoop, a Python MapReduce and HDFS API for Hadoop:


It is implemented as a Boost.Python wrapper around the C++ code (pipes
and libhdfs). It allows you to write complete MapReduce application in
CPython, with the same capabilities as the C++ API. Here is a minimal
wordcount example:
from pydoop.pipes import Mapper, Reducer, Factory, runTask

class WordCountMapper(Mapper):

  def __init__(self, context):
    super(WordCountMapper, self).__init__(context)

  def map(self, context):
    words = context.getInputValue().split()
    for w in words:
      context.emit(w, "1")

class WordCountReducer(Reducer):

  def __init__(self, context):
    super(WordCountReducer, self).__init__(context)

  def reduce(self, context):
    s = 0
    while context.nextValue():
      s += int(context.getInputValue())
    context.emit(context.getInputKey(), str(s))

runTask(Factory(WordCountMapper, WordCountReducer))
Any feedback would be greatly appreciated.

Simone Leo
