Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - Re: Finding mean and median python streaming


+
jamal sasha 2013-04-01, 21:27
+
jamal sasha 2013-04-01, 23:35
Copy link to this message
-
Re: Finding mean and median python streaming
Yanbo Liang 2013-04-02, 09:14
How many Reducer did you start for this job?
If you start many Reducers for this job, it will produce multiple output
file which named as part-*****.
And each part is only the local mean and median value of the specific
Reducer partition.

Two kinds of solutions:
1, Call the method of setNumReduceTasks(1) to set the Reducer number to 1,
and it will produce only one output file and each distinct key will produce
only one mean and median value.
2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
code. It process all the output file which produced by multiple Reducer by
a local function, and it produce the ultimate result.

BR
Yanbo
2013/4/2 jamal sasha <[EMAIL PROTECTED]>

> pinging again.
> Let me rephrase the question.
> If my data is like:
> id, value
>
> And I want to find average "value" for each id, how can i do that using
> hadoop streaming?
> I am sure, it should be very straightforward but aparently my
> understanding of how code works in hadoop streaming is not right.
> I would really appreciate if someone can help me with this query.
> THanks
>
>
>
> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <[EMAIL PROTECTED]> wrote:
>
>> data_dict is declared globably as
>> data_dict = defaultdict(list)
>>
>>
>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>>
>>> Very dumb question..
>>> I have data as following
>>> id1, value
>>> 1, 20.2
>>> 1,20.4
>>> ....
>>>
>>> I want to find the mean and median of id1?
>>> I am using python hadoop streaming..
>>> mapper.py
>>> for line in sys.stdin:
>>> try:
>>> # remove leading and trailing whitespace
>>>  line = line.rstrip(os.linesep)
>>> tokens = line.split(",")
>>>  print '%s,%s' % (tokens[0],tokens[1])
>>> except Exception:
>>> continue
>>>
>>>
>>> reducer.py
>>> def mean(data_list):
>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>> def median(mylist):
>>>     sorts = sorted(mylist)
>>>     length = len(sorts)
>>>     if not length % 2:
>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>     return sorts[length / 2]
>>>
>>>
>>> for line in sys.stdin:
>>> try:
>>> line = line.rstrip(os.linesep)
>>> serial_id, duration = line.split(",")
>>>  data_dict[serial_id].append(float(duration))
>>> except Exception:
>>> pass
>>> for k,v in data_dict.items():
>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>
>>>
>>>
>>> I am expecting a single mean,median to each key
>>> But I see id1 duplicated with different mean and median..
>>> Any suggestions?
>>>
>>
>>
>