Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: Naïve k-means using hadoop


Copy link to this message
-
Re: Naïve k-means using hadoop
And there is also Cascading ;) : http://www.cascading.org/
But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.

As for the number of reducers, you will have to do the math yourself but
I highly doubt that more than one reducer is needed (imho). But you can
indeed distribute the work by the center identifier.

Bertrand
On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <[EMAIL PROTECTED]> wrote:

> Thanks!
> *Bertrand*: I don't like the idea of using a single reducer. A better way
> for me is to write all the output of all the reducers to the same
> directory, and then distribute all the files.
> I know about Mahout of course, but I want to implement it myself. I will
> look at the documentation though.
> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
> the stuff you linked.
>
>
> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>
>> If you're also a fan of doing things the better way, you can also
>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>> this via https://github.com/cloudera/ml (blog post:
>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>
>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <[EMAIL PROTECTED]>
>> wrote:
>> > Hi,
>> > I'd like to implement k-means by myself, in the following naive way:
>> > Given a large set of vectors:
>> >
>> > Generate k random centers from set.
>> > Mapper reads all center and a split of the vectors set and emits for
>> each
>> > vector the closest center as a key.
>> > Reducer calculated new center and writes it.
>> > Goto step 2 until no change in the centers.
>> >
>> > My question is very basic: how do I distribute all the new centers
>> (produced
>> > by the reducers) to all the mappers? I can't use distributed cache
>> since its
>> > read-only. I can't use the context.write since it will create a file for
>> > each reduce task, and I need a single file. The more general issue here
>> is
>> > how to distribute data produced by reducer to all the mappers?
>> >
>> > Thanks.
>>
>>
>>
>> --
>> Harsh J
>>
>
>
--
Bertrand Dechoux
+
Charles Earl 2013-03-27, 14:39
+
Harsh J 2013-03-27, 12:46
+
Ted Dunning 2013-03-27, 16:47
+
Mark Miller 2013-03-27, 17:37
+
Bertrand Dechoux 2013-03-27, 12:34
+
Bertrand Dechoux 2013-03-27, 12:41
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB