Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Naïve k-means using hadoop

Copy link to this message
Re: Naïve k-means using hadoop
Of course, you should check out Mahout, at least the documentation, even if
you really want to implement it by yourself.



On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux <[EMAIL PROTECTED]>wrote:

> Actually for the first step, the client could create a file with the
> centers and then put it on hdfs and use it with distributed cache.
> A single reducer might be enough and that case, its only responsibility is
> to create the file with the updated centers.
> You can then use this new file again in the distributed cache instead of
> the first.
> Your real input will always be your set of points.
> Regards
> Bertrand
> PS : One reducer should be enough because it only needs to aggregate the
> partial update of each mapper. The volume of data send to the reducer will
> change according to the number of centers but not the number of points.
> On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <[EMAIL PROTECTED]>wrote:
>> Hi,
>> I'd like to implement k-means by myself, in the following naive way:
>> Given a large set of vectors:
>>    1. Generate k random centers from set.
>>    2. Mapper reads all center and a split of the vectors set and emits
>>    for each vector the closest center as a key.
>>    3. Reducer calculated new center and writes it.
>>    4. Goto step 2 until no change in the centers.
>> My question is very basic: how do I distribute all the new centers
>> (produced by the reducers) to all the mappers? I can't use distributed
>> cache since its read-only. I can't use the context.write since it will
>> create a file for each reduce task, and I need a single file. The more
>> general issue here is how to distribute data produced by reducer to all the
>> mappers?
>> Thanks.
Bertrand Dechoux