Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Naïve k-means using hadoop


Copy link to this message
-
Re: Naïve k-means using hadoop
Of course, you should check out Mahout, at least the documentation, even if
you really want to implement it by yourself.
https://cwiki.apache.org/MAHOUT/k-means-clustering.html

Regards

Bertrand

On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux <[EMAIL PROTECTED]>wrote:

> Actually for the first step, the client could create a file with the
> centers and then put it on hdfs and use it with distributed cache.
> A single reducer might be enough and that case, its only responsibility is
> to create the file with the updated centers.
> You can then use this new file again in the distributed cache instead of
> the first.
>
> Your real input will always be your set of points.
>
> Regards
>
> Bertrand
>
> PS : One reducer should be enough because it only needs to aggregate the
> partial update of each mapper. The volume of data send to the reducer will
> change according to the number of centers but not the number of points.
>
>
> On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <[EMAIL PROTECTED]>wrote:
>
>> Hi,
>> I'd like to implement k-means by myself, in the following naive way:
>> Given a large set of vectors:
>>
>>    1. Generate k random centers from set.
>>    2. Mapper reads all center and a split of the vectors set and emits
>>    for each vector the closest center as a key.
>>    3. Reducer calculated new center and writes it.
>>    4. Goto step 2 until no change in the centers.
>>
>> My question is very basic: how do I distribute all the new centers
>> (produced by the reducers) to all the mappers? I can't use distributed
>> cache since its read-only. I can't use the context.write since it will
>> create a file for each reduce task, and I need a single file. The more
>> general issue here is how to distribute data produced by reducer to all the
>> mappers?
>>
>> Thanks.
>>
>
>
--
Bertrand Dechoux
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB