Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Finding mean and median python streaming


Copy link to this message
-
Finding mean and median python streaming
Very dumb question..
I have data as following
id1, value
1, 20.2
1,20.4
....

I want to find the mean and median of id1?
I am using python hadoop streaming..
mapper.py
for line in sys.stdin:
try:
# remove leading and trailing whitespace
line = line.rstrip(os.linesep)
tokens = line.split(",")
 print '%s,%s' % (tokens[0],tokens[1])
except Exception:
continue
reducer.py
def mean(data_list):
return sum(data_list)/float(len(data_list)) if len(data_list) else 0
def median(mylist):
    sorts = sorted(mylist)
    length = len(sorts)
    if not length % 2:
        return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
    return sorts[length / 2]
for line in sys.stdin:
try:
line = line.rstrip(os.linesep)
serial_id, duration = line.split(",")
data_dict[serial_id].append(float(duration))
except Exception:
pass
for k,v in data_dict.items():
print "%s,%s,%s" %(k, mean(v), median(v))

I am expecting a single mean,median to each key
But I see id1 duplicated with different mean and median..
Any suggestions?