I have the problem of performing a operation of a data set on itself.
Assume, for example, that I have a list of people and their addresses
and for each person I want the ten closest members of the set. (this is not
the problem but illustrated critical aspects). I know that the ten closest
people will be in the same zipcode or a neighboring zip code. This means
unless the database is very large I can have the mapper send every person
out with keys representing their zipcode and also keys representing the
neighboring zip codes. In the reducer I can keep all people in memory and
compute distances between them (assume the distance computation is slightly
The problem is that this approach will not scale - eventually the number
of people assigned to a zip code will exceed memory. In the current problem
the number of "people" is about 100 million and doubling every 6 months.
The size of a "zipcode" requires keeping about 100,000 items in memory -
doable today but marginal in terms of future growth.
Are there other ways to solve the problem. I considered keeping a random
subset, finding the closest in that subset and then repeating with
different random subsets. The solution of midifying the splitter to
generate all pairs
not work for a dataset with 100 million items
Any bright ideas?