Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Finding mean and median python streaming


Copy link to this message
-
Re: Finding mean and median python streaming
Hi,
  I have a question.
Why do I need to set number of reducers to 1.
Since all the keys are sorted shouldnt most of the keys go to same reducer?

On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <[EMAIL PROTECTED]> wrote:

> How many Reducer did you start for this job?
> If you start many Reducers for this job, it will produce multiple output
> file which named as part-*****.
> And each part is only the local mean and median value of the specific
> Reducer partition.
>
> Two kinds of solutions:
> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
> 1, and it will produce only one output file and each distinct key will
> produce only one mean and median value.
> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
> code. It process all the output file which produced by multiple Reducer by
> a local function, and it produce the ultimate result.
>
> BR
> Yanbo
>
>
> 2013/4/2 jamal sasha <[EMAIL PROTECTED]>
>
>> pinging again.
>> Let me rephrase the question.
>> If my data is like:
>> id, value
>>
>> And I want to find average "value" for each id, how can i do that using
>> hadoop streaming?
>> I am sure, it should be very straightforward but aparently my
>> understanding of how code works in hadoop streaming is not right.
>> I would really appreciate if someone can help me with this query.
>> THanks
>>
>>
>>
>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>>
>>> data_dict is declared globably as
>>> data_dict = defaultdict(list)
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>>>
>>>> Very dumb question..
>>>> I have data as following
>>>> id1, value
>>>> 1, 20.2
>>>> 1,20.4
>>>> ....
>>>>
>>>> I want to find the mean and median of id1?
>>>> I am using python hadoop streaming..
>>>> mapper.py
>>>> for line in sys.stdin:
>>>> try:
>>>> # remove leading and trailing whitespace
>>>>  line = line.rstrip(os.linesep)
>>>> tokens = line.split(",")
>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>> except Exception:
>>>> continue
>>>>
>>>>
>>>> reducer.py
>>>> def mean(data_list):
>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>> def median(mylist):
>>>>     sorts = sorted(mylist)
>>>>     length = len(sorts)
>>>>     if not length % 2:
>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>     return sorts[length / 2]
>>>>
>>>>
>>>> for line in sys.stdin:
>>>> try:
>>>> line = line.rstrip(os.linesep)
>>>> serial_id, duration = line.split(",")
>>>>  data_dict[serial_id].append(float(duration))
>>>> except Exception:
>>>> pass
>>>> for k,v in data_dict.items():
>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>
>>>>
>>>>
>>>> I am expecting a single mean,median to each key
>>>> But I see id1 duplicated with different mean and median..
>>>> Any suggestions?
>>>>
>>>
>>>
>>
>