Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Finding mean and median python streaming


Copy link to this message
-
Finding mean and median python streaming
Very dumb question..
I have data as following
id1, value
1, 20.2
1,20.4
....

I want to find the mean and median of id1?
I am using python hadoop streaming..
mapper.py
for line in sys.stdin:
try:
# remove leading and trailing whitespace
line = line.rstrip(os.linesep)
tokens = line.split(",")
 print '%s,%s' % (tokens[0],tokens[1])
except Exception:
continue
reducer.py
def mean(data_list):
return sum(data_list)/float(len(data_list)) if len(data_list) else 0
def median(mylist):
    sorts = sorted(mylist)
    length = len(sorts)
    if not length % 2:
        return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
    return sorts[length / 2]
for line in sys.stdin:
try:
line = line.rstrip(os.linesep)
serial_id, duration = line.split(",")
data_dict[serial_id].append(float(duration))
except Exception:
pass
for k,v in data_dict.items():
print "%s,%s,%s" %(k, mean(v), median(v))

I am expecting a single mean,median to each key
But I see id1 duplicated with different mean and median..
Any suggestions?
+
jamal sasha 2013-04-04, 20:24
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB