Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Basic hadoop MR question


Copy link to this message
-
Basic hadoop MR question
Hi,
 I have a quick question. I am trying to write MR code using python.
In the word count example:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

The reducer..
Why cant in the reducer I can declare a ditionary (hashmap) whose key is
word and value is a list of count (1's here)

So something like:

data_dict = defaultdict(list)
for line in sys.stdin:
       tokens = line.split("\t")
       data_dict[tokens[0]].append(1)

for k,v in data_dict.items():
    print k,sum(v)

Also, in the reducer code mentioned in the link.. Why are the follwoing
lines needed:
# do not forget to output the last word if needed! if current_word == word:
print '%s\t%s' % (current_word, current_count)

THough the code is well commented.. :( My apologies for asking naive
questions.
THanks