Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: Finding mean and median python streaming


Copy link to this message
-
Re: Finding mean and median python streaming
Because each Reducer is just the local mean and median of the data which
are shuffled to that Reducer.
2013/4/5 jamal sasha <[EMAIL PROTECTED]>

> Hi,
>   I have a question.
> Why do I need to set number of reducers to 1.
> Since all the keys are sorted shouldnt most of the keys go to same reducer?
>
>
>
> On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <[EMAIL PROTECTED]> wrote:
>
>> How many Reducer did you start for this job?
>> If you start many Reducers for this job, it will produce multiple output
>> file which named as part-*****.
>> And each part is only the local mean and median value of the specific
>> Reducer partition.
>>
>> Two kinds of solutions:
>> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
>> 1, and it will produce only one output file and each distinct key will
>> produce only one mean and median value.
>> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
>> code. It process all the output file which produced by multiple Reducer by
>> a local function, and it produce the ultimate result.
>>
>> BR
>>  Yanbo
>>
>>
>> 2013/4/2 jamal sasha <[EMAIL PROTECTED]>
>>
>>> pinging again.
>>> Let me rephrase the question.
>>> If my data is like:
>>> id, value
>>>
>>> And I want to find average "value" for each id, how can i do that using
>>> hadoop streaming?
>>> I am sure, it should be very straightforward but aparently my
>>> understanding of how code works in hadoop streaming is not right.
>>> I would really appreciate if someone can help me with this query.
>>> THanks
>>>
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>>>
>>>> data_dict is declared globably as
>>>> data_dict = defaultdict(list)
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Very dumb question..
>>>>> I have data as following
>>>>> id1, value
>>>>> 1, 20.2
>>>>> 1,20.4
>>>>> ....
>>>>>
>>>>> I want to find the mean and median of id1?
>>>>> I am using python hadoop streaming..
>>>>> mapper.py
>>>>> for line in sys.stdin:
>>>>> try:
>>>>> # remove leading and trailing whitespace
>>>>>  line = line.rstrip(os.linesep)
>>>>> tokens = line.split(",")
>>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>>> except Exception:
>>>>> continue
>>>>>
>>>>>
>>>>> reducer.py
>>>>> def mean(data_list):
>>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>>> def median(mylist):
>>>>>     sorts = sorted(mylist)
>>>>>     length = len(sorts)
>>>>>     if not length % 2:
>>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>>     return sorts[length / 2]
>>>>>
>>>>>
>>>>> for line in sys.stdin:
>>>>> try:
>>>>> line = line.rstrip(os.linesep)
>>>>> serial_id, duration = line.split(",")
>>>>>  data_dict[serial_id].append(float(duration))
>>>>> except Exception:
>>>>> pass
>>>>> for k,v in data_dict.items():
>>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>>
>>>>>
>>>>>
>>>>> I am expecting a single mean,median to each key
>>>>> But I see id1 duplicated with different mean and median..
>>>>> Any suggestions?
>>>>>
>>>>
>>>>
>>>
>>
>