And there is also Cascading ;) : http://www.cascading.org/
But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
As for the number of reducers, you will have to do the math yourself but
I highly doubt that more than one reducer is needed (imho). But you can
indeed distribute the work by the center identifier.
On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <[EMAIL PROTECTED]> wrote:
> *Bertrand*: I don't like the idea of using a single reducer. A better way
> for me is to write all the output of all the reducers to the same
> directory, and then distribute all the files.
> I know about Mahout of course, but I want to implement it myself. I will
> look at the documentation though.
> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
> the stuff you linked.
> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> If you're also a fan of doing things the better way, you can also
>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>> this via https://github.com/cloudera/ml (blog post:
>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <[EMAIL PROTECTED]>
>> > Hi,
>> > I'd like to implement k-means by myself, in the following naive way:
>> > Given a large set of vectors:
>> > Generate k random centers from set.
>> > Mapper reads all center and a split of the vectors set and emits for
>> > vector the closest center as a key.
>> > Reducer calculated new center and writes it.
>> > Goto step 2 until no change in the centers.
>> > My question is very basic: how do I distribute all the new centers
>> > by the reducers) to all the mappers? I can't use distributed cache
>> since its
>> > read-only. I can't use the context.write since it will create a file for
>> > each reduce task, and I need a single file. The more general issue here
>> > how to distribute data produced by reducer to all the mappers?
>> > Thanks.
>> Harsh J