Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Basic hadoop MR question


Copy link to this message
-
Basic hadoop MR question
Hi,
 I have a quick question. I am trying to write MR code using python.
In the word count example:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

The reducer..
Why cant in the reducer I can declare a ditionary (hashmap) whose key is
word and value is a list of count (1's here)

So something like:

data_dict = defaultdict(list)
for line in sys.stdin:
       tokens = line.split("\t")
       data_dict[tokens[0]].append(1)

for k,v in data_dict.items():
    print k,sum(v)

Also, in the reducer code mentioned in the link.. Why are the follwoing
lines needed:
# do not forget to output the last word if needed! if current_word == word:
print '%s\t%s' % (current_word, current_count)

THough the code is well commented.. :( My apologies for asking naive
questions.
THanks
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB