Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Finding mean and median python streaming


Copy link to this message
-
Re: Finding mean and median python streaming
Hi,
  I have a question.
Why do I need to set number of reducers to 1.
Since all the keys are sorted shouldnt most of the keys go to same reducer?

On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <[EMAIL PROTECTED]> wrote:

> How many Reducer did you start for this job?
> If you start many Reducers for this job, it will produce multiple output
> file which named as part-*****.
> And each part is only the local mean and median value of the specific
> Reducer partition.
>
> Two kinds of solutions:
> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
> 1, and it will produce only one output file and each distinct key will
> produce only one mean and median value.
> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
> code. It process all the output file which produced by multiple Reducer by
> a local function, and it produce the ultimate result.
>
> BR
> Yanbo
>
>
> 2013/4/2 jamal sasha <[EMAIL PROTECTED]>
>
>> pinging again.
>> Let me rephrase the question.
>> If my data is like:
>> id, value
>>
>> And I want to find average "value" for each id, how can i do that using
>> hadoop streaming?
>> I am sure, it should be very straightforward but aparently my
>> understanding of how code works in hadoop streaming is not right.
>> I would really appreciate if someone can help me with this query.
>> THanks
>>
>>
>>
>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>>
>>> data_dict is declared globably as
>>> data_dict = defaultdict(list)
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>>>
>>>> Very dumb question..
>>>> I have data as following
>>>> id1, value
>>>> 1, 20.2
>>>> 1,20.4
>>>> ....
>>>>
>>>> I want to find the mean and median of id1?
>>>> I am using python hadoop streaming..
>>>> mapper.py
>>>> for line in sys.stdin:
>>>> try:
>>>> # remove leading and trailing whitespace
>>>>  line = line.rstrip(os.linesep)
>>>> tokens = line.split(",")
>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>> except Exception:
>>>> continue
>>>>
>>>>
>>>> reducer.py
>>>> def mean(data_list):
>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>> def median(mylist):
>>>>     sorts = sorted(mylist)
>>>>     length = len(sorts)
>>>>     if not length % 2:
>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>     return sorts[length / 2]
>>>>
>>>>
>>>> for line in sys.stdin:
>>>> try:
>>>> line = line.rstrip(os.linesep)
>>>> serial_id, duration = line.split(",")
>>>>  data_dict[serial_id].append(float(duration))
>>>> except Exception:
>>>> pass
>>>> for k,v in data_dict.items():
>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>
>>>>
>>>>
>>>> I am expecting a single mean,median to each key
>>>> But I see id1 duplicated with different mean and median..
>>>> Any suggestions?
>>>>
>>>
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB