Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - Re: Finding mean and median python streaming


+
jamal sasha 2013-04-01, 21:27
Copy link to this message
-
Re: Finding mean and median python streaming
jamal sasha 2013-04-01, 23:35
pinging again.
Let me rephrase the question.
If my data is like:
id, value

And I want to find average "value" for each id, how can i do that using
hadoop streaming?
I am sure, it should be very straightforward but aparently my understanding
of how code works in hadoop streaming is not right.
I would really appreciate if someone can help me with this query.
THanks

On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <[EMAIL PROTECTED]> wrote:

> data_dict is declared globably as
> data_dict = defaultdict(list)
>
>
> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <[EMAIL PROTECTED]> wrote:
>
>> Very dumb question..
>> I have data as following
>> id1, value
>> 1, 20.2
>> 1,20.4
>> ....
>>
>> I want to find the mean and median of id1?
>> I am using python hadoop streaming..
>> mapper.py
>> for line in sys.stdin:
>> try:
>> # remove leading and trailing whitespace
>>  line = line.rstrip(os.linesep)
>> tokens = line.split(",")
>>  print '%s,%s' % (tokens[0],tokens[1])
>> except Exception:
>> continue
>>
>>
>> reducer.py
>> def mean(data_list):
>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>> def median(mylist):
>>     sorts = sorted(mylist)
>>     length = len(sorts)
>>     if not length % 2:
>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>     return sorts[length / 2]
>>
>>
>> for line in sys.stdin:
>> try:
>> line = line.rstrip(os.linesep)
>> serial_id, duration = line.split(",")
>>  data_dict[serial_id].append(float(duration))
>> except Exception:
>> pass
>> for k,v in data_dict.items():
>> print "%s,%s,%s" %(k, mean(v), median(v))
>>
>>
>>
>> I am expecting a single mean,median to each key
>> But I see id1 duplicated with different mean and median..
>> Any suggestions?
>>
>
>
+
Yanbo Liang 2013-04-02, 09:14