Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> best way to join?


Copy link to this message
-
Re: best way to join?
I don't know off-hand.  I don't understand the importance of your
constraint either.

On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <[EMAIL PROTECTED]>wrote:

> Ok, but as i said before, how do i achieve the same result with out
> clustering , just linear. Join on the same data-set basically?
>
> and calculating the distance as i go
>
> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <[EMAIL PROTECTED]>wrote:
>
>> I don't mean that.
>>
>> I mean that a k-means clustering with pretty large clusters is a useful
>> auxiliary data structure for finding nearest neighbors.  The basic outline
>> is that you find the nearest clusters and search those for near neighbors.
>>  The first riff is that you use a clever data structure for finding the
>> nearest clusters so that you can do that faster than linear search.  The
>> second riff is when you use another clever data structure to search each
>> cluster quickly.
>>
>> There are fancier data structures available as well.
>>
>>
>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <[EMAIL PROTECTED]
>> > wrote:
>>
>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>> which is:
>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>
>>> for example, and you say: here we see a cluster basically, that cluster
>>> is represented by the point:  [40.123,-50.432]
>>> which points does this cluster contains?  [[41.431,-
>>> 43.32],[...,...],...,[...]]
>>> meaning: that for every point i have in the dataset, you create a
>>> cluster.
>>> If you don't mean that, but you do mean to create clusters based on some
>>> random-seed points or what not, that would mean
>>>  that i'll have points (talking about the "end goal") that won't have
>>> enough points in their list.
>>>
>>> one of the criterions for a clustering is that for any clusters: C_i and
>>> C_j (where i != j), C_i intersect C_j is empty
>>>
>>> and again, how can i accomplish my task with out running mahout / knn
>>> algo? just by calculating distance between points?
>>> join of a file with it self.
>>>
>>> Thanks
>>>
>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <[EMAIL PROTECTED]>wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>>
>>>>> I understand your solution ( i think) , didn't think of that, in that
>>>>> particular way.
>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>> is an overkill.
>>>>>
>>>>
>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>> clusters.
>>>>
>>>>
>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>
>>>>
>>>> Running with a good knn package will give you roughly O(n log n)
>>>> complexity.
>>>>
>>>>
>>>
>>
>