Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Naïve k-means using hadoop


+
Yaron Gonen 2013-03-27, 09:59
Copy link to this message
-
Re: Naïve k-means using hadoop
Spark would be an excellent choice for the iterative sort of k-means.

It could be good for sketch-based algorithms as well, but the difference
would be much less pronounced.

On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl <[EMAIL PROTECTED]> wrote:

> I would think also that starting with centers in some in-memory Hadoop
> platform like spark would also be a valid approach.
> I think the spark demo assumes that the data set is cached vs just centers.
> C
>
> On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
>
> And there is also Cascading ;) : http://www.cascading.org/
> But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
>
> As for the number of reducers, you will have to do the math yourself but
> I highly doubt that more than one reducer is needed (imho). But you can
> indeed distribute the work by the center identifier.
>
> Bertrand
>
>
> On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <[EMAIL PROTECTED]>wrote:
>
>> Thanks!
>> *Bertrand*: I don't like the idea of using a single reducer. A better
>> way for me is to write all the output of all the reducers to the same
>> directory, and then distribute all the files.
>> I know about Mahout of course, but I want to implement it myself. I will
>> look at the documentation though.
>> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll
>> read the stuff you linked.
>>
>>
>> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>>
>>> If you're also a fan of doing things the better way, you can also
>>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>>> this via https://github.com/cloudera/ml (blog post:
>>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>>
>>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <[EMAIL PROTECTED]>
>>> wrote:
>>> > Hi,
>>> > I'd like to implement k-means by myself, in the following naive way:
>>> > Given a large set of vectors:
>>> >
>>> > Generate k random centers from set.
>>> > Mapper reads all center and a split of the vectors set and emits for
>>> each
>>> > vector the closest center as a key.
>>> > Reducer calculated new center and writes it.
>>> > Goto step 2 until no change in the centers.
>>> >
>>> > My question is very basic: how do I distribute all the new centers
>>> (produced
>>> > by the reducers) to all the mappers? I can't use distributed cache
>>> since its
>>> > read-only. I can't use the context.write since it will create a file
>>> for
>>> > each reduce task, and I need a single file. The more general issue
>>> here is
>>> > how to distribute data produced by reducer to all the mappers?
>>> >
>>> > Thanks.
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB