Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: best way to join?


+
Mirko Kämpf 2012-09-09, 09:55
+
Ted Dunning 2012-08-28, 20:07
+
dexter morgan 2012-08-30, 09:21
Copy link to this message
-
Re: best way to join?
I don't know off-hand.  I don't understand the importance of your
constraint either.

On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <[EMAIL PROTECTED]>wrote:

> Ok, but as i said before, how do i achieve the same result with out
> clustering , just linear. Join on the same data-set basically?
>
> and calculating the distance as i go
>
> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <[EMAIL PROTECTED]>wrote:
>
>> I don't mean that.
>>
>> I mean that a k-means clustering with pretty large clusters is a useful
>> auxiliary data structure for finding nearest neighbors.  The basic outline
>> is that you find the nearest clusters and search those for near neighbors.
>>  The first riff is that you use a clever data structure for finding the
>> nearest clusters so that you can do that faster than linear search.  The
>> second riff is when you use another clever data structure to search each
>> cluster quickly.
>>
>> There are fancier data structures available as well.
>>
>>
>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <[EMAIL PROTECTED]
>> > wrote:
>>
>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>> which is:
>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>
>>> for example, and you say: here we see a cluster basically, that cluster
>>> is represented by the point:  [40.123,-50.432]
>>> which points does this cluster contains?  [[41.431,-
>>> 43.32],[...,...],...,[...]]
>>> meaning: that for every point i have in the dataset, you create a
>>> cluster.
>>> If you don't mean that, but you do mean to create clusters based on some
>>> random-seed points or what not, that would mean
>>>  that i'll have points (talking about the "end goal") that won't have
>>> enough points in their list.
>>>
>>> one of the criterions for a clustering is that for any clusters: C_i and
>>> C_j (where i != j), C_i intersect C_j is empty
>>>
>>> and again, how can i accomplish my task with out running mahout / knn
>>> algo? just by calculating distance between points?
>>> join of a file with it self.
>>>
>>> Thanks
>>>
>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <[EMAIL PROTECTED]>wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>>
>>>>> I understand your solution ( i think) , didn't think of that, in that
>>>>> particular way.
>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>> is an overkill.
>>>>>
>>>>
>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>> clusters.
>>>>
>>>>
>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>
>>>>
>>>> Running with a good knn package will give you roughly O(n log n)
>>>> complexity.
>>>>
>>>>
>>>
>>
>
+
dexter morgan 2012-08-31, 13:03
+
dexter morgan 2012-09-02, 16:26
+
Ted Dunning 2012-09-03, 19:47
+
dexter morgan 2012-08-27, 20:15
+
Björn-Elmar Macek 2012-09-04, 08:17
+
Ted Dunning 2012-08-27, 21:52
+
dexter morgan 2012-08-28, 13:48
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB