Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Re: Finding mean and median python streaming


Copy link to this message
-
Re: Finding mean and median python streaming
jamal sasha 2013-04-01, 21:27
data_dict is declared globably as
data_dict = defaultdict(list)
On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <[EMAIL PROTECTED]> wrote:

> Very dumb question..
> I have data as following
> id1, value
> 1, 20.2
> 1,20.4
> ....
>
> I want to find the mean and median of id1?
> I am using python hadoop streaming..
> mapper.py
> for line in sys.stdin:
> try:
> # remove leading and trailing whitespace
>  line = line.rstrip(os.linesep)
> tokens = line.split(",")
>  print '%s,%s' % (tokens[0],tokens[1])
> except Exception:
> continue
>
>
> reducer.py
> def mean(data_list):
> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
> def median(mylist):
>     sorts = sorted(mylist)
>     length = len(sorts)
>     if not length % 2:
>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>     return sorts[length / 2]
>
>
> for line in sys.stdin:
> try:
> line = line.rstrip(os.linesep)
> serial_id, duration = line.split(",")
>  data_dict[serial_id].append(float(duration))
> except Exception:
> pass
> for k,v in data_dict.items():
> print "%s,%s,%s" %(k, mean(v), median(v))
>
>
>
> I am expecting a single mean,median to each key
> But I see id1 duplicated with different mean and median..
> Any suggestions?
>