Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Naïve k-means using hadoop


Copy link to this message
-
Re: Naïve k-means using hadoop
Actually for the first step, the client could create a file with the
centers and then put it on hdfs and use it with distributed cache.
A single reducer might be enough and that case, its only responsibility is
to create the file with the updated centers.
You can then use this new file again in the distributed cache instead of
the first.

Your real input will always be your set of points.

Regards

Bertrand

PS : One reducer should be enough because it only needs to aggregate the
partial update of each mapper. The volume of data send to the reducer will
change according to the number of centers but not the number of points.
On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <[EMAIL PROTECTED]> wrote:

> Hi,
> I'd like to implement k-means by myself, in the following naive way:
> Given a large set of vectors:
>
>    1. Generate k random centers from set.
>    2. Mapper reads all center and a split of the vectors set and emits
>    for each vector the closest center as a key.
>    3. Reducer calculated new center and writes it.
>    4. Goto step 2 until no change in the centers.
>
> My question is very basic: how do I distribute all the new centers
> (produced by the reducers) to all the mappers? I can't use distributed
> cache since its read-only. I can't use the context.write since it will
> create a file for each reduce task, and I need a single file. The more
> general issue here is how to distribute data produced by reducer to all the
> mappers?
>
> Thanks.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB