

On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <[EMAIL PROTECTED]>wrote:
> > I understand your solution ( i think) , didn't think of that, in that > particular way. > I think that lets say i have 1M datapoints, and running knn , that the > k=1M and n=10 (each point is a cluster that requires up to 10 points) > is an overkill. >
I am not sure I understand you. n = number of points. k = number of clusters. For searching 1 million points, I would recommend thousands of clusters. > How can i achieve the same result WITHOUT using mahout, just running on > the dataset , i even think it'll be in the same complexity (o(n^2)) >
Running with a good knn package will give you roughly O(n log n) complexity.
Right, but if i understood your sugesstion, you look at the end goal , which is: 1[40.123,50.432]\t[[41.431,43.32],[...,...],...,[...]]
for example, and you say: here we see a cluster basically, that cluster is represented by the point: [40.123,50.432] which points does this cluster contains? [[41.431, 43.32],[...,...],...,[...]] meaning: that for every point i have in the dataset, you create a cluster. If you don't mean that, but you do mean to create clusters based on some randomseed points or what not, that would mean that i'll have points (talking about the "end goal") that won't have enough points in their list.
one of the criterions for a clustering is that for any clusters: C_i and C_j (where i != j), C_i intersect C_j is empty
and again, how can i accomplish my task with out running mahout / knn algo? just by calculating distance between points? join of a file with it self.
Thanks
On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> > > On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <[EMAIL PROTECTED]>wrote: > >> >> I understand your solution ( i think) , didn't think of that, in that >> particular way. >> I think that lets say i have 1M datapoints, and running knn , that the >> k=1M and n=10 (each point is a cluster that requires up to 10 points) >> is an overkill. >> > > I am not sure I understand you. n = number of points. k = number of > clusters. For searching 1 million points, I would recommend thousands of > clusters. > > >> How can i achieve the same result WITHOUT using mahout, just running on >> the dataset , i even think it'll be in the same complexity (o(n^2)) >> > > Running with a good knn package will give you roughly O(n log n) > complexity. > >
Yes. I think you misunderstood.
My suggestion is that the few clusters near a point are very likely to contain the nearest points. If you scan these clusters computing the distance to your original point, you should be able to find a list of points that overlaps with your desired result. Keep in mind that I am suggesting that you use a LOT of clusters here, much more than is typical. Typically, I recommend sqrt(N)/m clusters for this kind of work where m is a small constant 1 <= m <= 30.
The basic idea for the method is that there are many points that are obviously far from your query. You don't have to compute the distance to these data points to your query since you can eliminate them from consideration. Many of them can be absolutely eliminated by appeal to the triangle inequality, but many more can be eliminated safely with reasonably high probability. It is a reasonable heuristic that the farthest clusters contain points that can be eliminated this way. Another way to state this is to say that you should only search through the points that are in clusters that are near your original cluster.
On Fri, Aug 31, 2012 at 9:03 AM, dexter morgan <[EMAIL PROTECTED]>wrote:
> Hi Ted, > > First of all, i'd like to know how to make a map/reduce job that does join > on the inputfile it self. > Second, maybe your clustering approach be usefull, i still think it's not > correct. > Reason: > Lets say i want to find the 10 closest points for a given point. Point: > [120,90] for example. > Clustering approach: which cluster has [120,90] as a node? answer: the > cluster at [300,200] > Now, if i understood you, i should get the 10 nearest neighbors of > [300,200] (again, you didn't elaborate much on this or i didn't understand > it) > > But i require the 10 nearest to [120,90] , not to [300,200]. Even if i > know the distances from [120,90] to [300,200] and to the 10 nearest points > to [300,200] it won't help me, because maybe the 10 nearest points to > [120,90] are actually starting from the 5000th nearest points to [300,200]. > > In the end my goal is to preprocess (as i wrote at the begining) this > list of N nearest points for every point in the file. Where N is a > parameter given to the job. Let's say 10 points. That's it. > No calculation afterwards, only querying that list. > > Thank you > > On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <[EMAIL PROTECTED]>wrote: > >> I don't know offhand. I don't understand the importance of your >> constraint either. >> >> >> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <[EMAIL PROTECTED]>wrote: >> >>> Ok, but as i said before, how do i achieve the same result with out >>> clustering , just linear. Join on the same dataset basically? >>> >>> and calculating the distance as i go >>> >>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <[EMAIL PROTECTED]>wrote: >>> >>>> I don't mean that. >>>> >>>> I mean that a kmeans clustering with pretty large clusters is a useful >>>> auxiliary data structure for finding nearest neighbors. The basic outline >>>> is that you find the nearest clusters and search those for near neighbors. >>>> The first riff is that you use a clever data structure for finding the >>>> nearest clusters so that you can do that faster than linear search. The >>>> second riff is when you use another clever data structure to search each >>>> cluster quickly. >>>> >>>> There are fancier data structures available as well. >>>> >>>> >>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> Right, but if i understood your sugesstion, you look at the end goal , >>>>> which is: >>>>> 1[40.123,50.432]\t[[41.431,43.32],[...,...],...,[...]] >>>>> >>>>> for example, and you say: here we see a cluster basically, that >>>>> cluster is represented by the point: [40.123,50.432] >>>>> which points does this cluster contains? [[41.431, >>>>> 43.32],[...,...],...,[...]] >>>>> meaning: that for every point i have in the dataset, you create a
Elmar, Right, thanks a lot for your help, If you'll read what Ted suggested its basically this. I'm interesting in knowing how to do this using JOIN (mapjoin + reducerjoin i suppose) as well... though i'll go with the mahout approach Best, Dex On Tue, Sep 4, 2012 at 11:17 AM, BjörnElmar Macek <[EMAIL PROTECTED]>wrote: > Hi Dexter, > > i think, what you want is a clustering of points based on the euclidian > distance or density based clustering ( http://en.wikipedia.org/wiki/**> Cluster_analysis < http://en.wikipedia.org/wiki/Cluster_analysis> ). I bet > there are some implemented quite well in Mahout already: afaik this is the > datamining framework based on Hadoop. > > Best luck! > Elmar > > > Am 27.08.2012 22:15, schrieb dexter morgan: > > Dear list, >> >> Lets say i have a file, like this: >> id \t at,tlng < structure >> >> 1\t40.123,50.432 >> 2\t41.431,43.32 >> ... >> ... >> lets call it: 'points.txt' >> I'm trying to build a mapreduce job that runs over this BIG points file >> and it should output >> a file, that will look like: >> id[lat,lng] \t [list of points in JSON standard] < structure >> >> 1[40.123,50.432]\t[[41.431,**43.32],[...,...],...,[...]] >> 2[41.431,43.32]\t[[40.123,**50.432],...[,]] >> ... >> >> Basically it should run on ITSELF, and grab for each point the N (it will >> be an argument for the job) CLOSEST points (the mappers should calculate >> the distance).. >> >> Distributed cache is not an option, what else? not sure if to classify >> it as a mapjoin , reducejoin or both? >> Would you do this in HIVE some how? >> Is it feasible in a single job? >> >> Would LOVE to hear your suggestions, code (if you think its not that >> hard) or what not. >> BTW using CDH3  rev 3 (20.23) >> >> Thanks! >> > >

