Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Finding mean and median python streaming


Copy link to this message
-
Re: Finding mean and median python streaming
data_dict is declared globably as
data_dict = defaultdict(list)
On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <[EMAIL PROTECTED]> wrote:

> Very dumb question..
> I have data as following
> id1, value
> 1, 20.2
> 1,20.4
> ....
>
> I want to find the mean and median of id1?
> I am using python hadoop streaming..
> mapper.py
> for line in sys.stdin:
> try:
> # remove leading and trailing whitespace
>  line = line.rstrip(os.linesep)
> tokens = line.split(",")
>  print '%s,%s' % (tokens[0],tokens[1])
> except Exception:
> continue
>
>
> reducer.py
> def mean(data_list):
> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
> def median(mylist):
>     sorts = sorted(mylist)
>     length = len(sorts)
>     if not length % 2:
>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>     return sorts[length / 2]
>
>
> for line in sys.stdin:
> try:
> line = line.rstrip(os.linesep)
> serial_id, duration = line.split(",")
>  data_dict[serial_id].append(float(duration))
> except Exception:
> pass
> for k,v in data_dict.items():
> print "%s,%s,%s" %(k, mean(v), median(v))
>
>
>
> I am expecting a single mean,median to each key
> But I see id1 duplicated with different mean and median..
> Any suggestions?
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB