Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: best way to join?


+
Mirko Kämpf 2012-09-09, 09:55
Copy link to this message
-
Re: best way to join?
I don't mean that.

I mean that a k-means clustering with pretty large clusters is a useful
auxiliary data structure for finding nearest neighbors.  The basic outline
is that you find the nearest clusters and search those for near neighbors.
 The first riff is that you use a clever data structure for finding the
nearest clusters so that you can do that faster than linear search.  The
second riff is when you use another clever data structure to search each
cluster quickly.

There are fancier data structures available as well.

On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <[EMAIL PROTECTED]>wrote:

> Right, but if i understood your sugesstion, you look at the end goal ,
> which is:
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>
> for example, and you say: here we see a cluster basically, that cluster is
> represented by the point:  [40.123,-50.432]
> which points does this cluster contains?  [[41.431,-
> 43.32],[...,...],...,[...]]
> meaning: that for every point i have in the dataset, you create a cluster.
> If you don't mean that, but you do mean to create clusters based on some
> random-seed points or what not, that would mean
>  that i'll have points (talking about the "end goal") that won't have
> enough points in their list.
>
> one of the criterions for a clustering is that for any clusters: C_i and
> C_j (where i != j), C_i intersect C_j is empty
>
> and again, how can i accomplish my task with out running mahout / knn
> algo? just by calculating distance between points?
> join of a file with it self.
>
> Thanks
>
> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <[EMAIL PROTECTED]>wrote:
>
>>
>>
>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <[EMAIL PROTECTED]>wrote:
>>
>>>
>>> I understand your solution ( i think) , didn't think of that, in that
>>> particular way.
>>> I think that lets say i have 1M data-points, and running knn , that the
>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>> is an overkill.
>>>
>>
>> I am not sure I understand you.  n = number of points.  k = number of
>> clusters.  For searching 1 million points, I would recommend thousands of
>> clusters.
>>
>>
>>> How can i achieve the same result WITHOUT using mahout, just running on
>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>>
>>
>> Running with a good knn package will give you roughly O(n log n)
>> complexity.
>>
>>
>
+
dexter morgan 2012-08-30, 09:21
+
Ted Dunning 2012-08-30, 20:05
+
dexter morgan 2012-08-31, 13:03
+
dexter morgan 2012-09-02, 16:26
+
Ted Dunning 2012-09-03, 19:47
+
dexter morgan 2012-08-27, 20:15
+
Björn-Elmar Macek 2012-09-04, 08:17
+
Ted Dunning 2012-08-27, 21:52
+
dexter morgan 2012-08-28, 13:48
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB