
+
Steve Lewis 20130814, 17:06

Re: How do I perform a scalable cartesian product
Hi Steve,
if the only problem is that the size of your zipcode squared (more accurately, n * (n1) if you order the pairs of persons, assuming that distance is symmetric) is too large, it might help to split the zipcode into buckets by some hash function and partition the allpairs computations over buckets. That is, if you have 10 buckets containing the users of a zipcode and its neighboring zipcodes, first compute all pairs of persons within bucket 1, then all pairs of persons in bucket 1 and 2, then 13, 14 etc. up to 1010. Obviously, you don't need to do 41 if you've already done 14 (symmetry, see above), so you'll end up doing n * (n+1) pairs of buckets (55 in this case). Basically, this means creating artificial, smaller zipcodes. Apart from that, I'd like to point out that there's been a lot of research on nearestneighbor search; perhaps some stateoftheart algorithm will be applicable to your problem. http://en.wikipedia.org/wiki/Nearest_neighbor_search Hope this helps, Christoph On 14.08.2013 19:06, Steve Lewis wrote: > I have the problem of performing a operation of a data set on itself. > > Assume, for example, that I have a list of people and their > addresses and for each person I want the ten closest members of the set. > (this is not the problem but illustrated critical aspects). I know that > the ten closest people will be in the same zipcode or a neighboring zip > code. This means unless the database is very large I can have the mapper > send every person out with keys representing their zipcode and also > keys representing the neighboring zip codes. In the reducer I can keep > all people in memory and compute distances between them (assume the > distance computation is slightly expensive). > The problem is that this approach will not scale  eventually the > number of people assigned to a zip code will exceed memory. In the > current problem the number of "people" is about 100 million and doubling > every 6 months. The size of a "zipcode" requires keeping about 100,000 > items in memory  doable today but marginal in terms of future growth. > Are there other ways to solve the problem. I considered keeping a > random subset, finding the closest in that subset and then repeating > with different random subsets. The solution of midifying the splitter to > generate all pairs > https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java will > not work for a dataset with 100 million items > Any bright ideas? > > > >  Christoph Schmitz SoftwareArchitekt Targeting Core Product 1&1 Internet AG  Brauerstraï¿½e 50  76135 Karlsruhe  Germany Phone: +49 721 913746733 EMail: [EMAIL PROTECTED]  Web: www.1und1.de Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 6484 Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, Andreas Hofmann, Markus Huhn, HansHenning Kettler, Uwe Lamnek, Jan Oetjen, Christian Wï¿½rst Aufsichtsratsvorsitzender: Michael Scheeren Member of United Internet Diese EMail kann vertrauliche und/oder gesetzlich geschï¿½tzte Informationen enthalten. Wenn Sie nicht der bestimmungsgemï¿½ï¿½e Adressat sind oder diese EMail irrtï¿½mlich erhalten haben, unterrichten Sie bitte den Absender und vernichten Sie diese Email. Anderen als dem bestimmungsgemï¿½ï¿½en Adressaten ist untersagt, diese EMail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden. This EMail may contain confidential and/or privileged information. If you are not the intended recipient of this EMail, you are hereby notified that saving, distribution or use of the content of this EMail in any way is prohibited. If you have received this EMail in error, please notify the sender and delete the EMail. 

