Of course, you should check out Mahout, at least the documentation, even if
you really want to implement it by yourself.
On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux <[EMAIL PROTECTED]>wrote:
> Actually for the first step, the client could create a file with the
> centers and then put it on hdfs and use it with distributed cache.
> A single reducer might be enough and that case, its only responsibility is
> to create the file with the updated centers.
> You can then use this new file again in the distributed cache instead of
> the first.
> Your real input will always be your set of points.
> PS : One reducer should be enough because it only needs to aggregate the
> partial update of each mapper. The volume of data send to the reducer will
> change according to the number of centers but not the number of points.
> On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <[EMAIL PROTECTED]>wrote:
>> I'd like to implement k-means by myself, in the following naive way:
>> Given a large set of vectors:
>> 1. Generate k random centers from set.
>> 2. Mapper reads all center and a split of the vectors set and emits
>> for each vector the closest center as a key.
>> 3. Reducer calculated new center and writes it.
>> 4. Goto step 2 until no change in the centers.
>> My question is very basic: how do I distribute all the new centers
>> (produced by the reducers) to all the mappers? I can't use distributed
>> cache since its read-only. I can't use the context.write since it will
>> create a file for each reduce task, and I need a single file. The more
>> general issue here is how to distribute data produced by reducer to all the