Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Naïve k-means using hadoop

Copy link to this message
Re: Naïve k-means using hadoop
And, of course, due credit should be given here.  The advanced clustering
algorithms in Crunch were lifted from the new stuff in Mahout pretty much
step for step.

The Mahout group would have loved to have contributions from the Cloudera
guys instead of re-implementation, but you can't legislate taste.
On Wed, Mar 27, 2013 at 1:46 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> If you're also a fan of doing things the better way, you can also
> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
> this via https://github.com/cloudera/ml (blog post:
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <[EMAIL PROTECTED]>
> wrote:
> > Hi,
> > I'd like to implement k-means by myself, in the following naive way:
> > Given a large set of vectors:
> >
> > Generate k random centers from set.
> > Mapper reads all center and a split of the vectors set and emits for each
> > vector the closest center as a key.
> > Reducer calculated new center and writes it.
> > Goto step 2 until no change in the centers.
> >
> > My question is very basic: how do I distribute all the new centers
> (produced
> > by the reducers) to all the mappers? I can't use distributed cache since
> its
> > read-only. I can't use the context.write since it will create a file for
> > each reduce task, and I need a single file. The more general issue here
> is
> > how to distribute data produced by reducer to all the mappers?
> >
> > Thanks.
> --
> Harsh J