Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> How do I perform a scalable cartesian product


Copy link to this message
-
How do I perform a scalable cartesian product
 I have the problem of performing a operation of a data set on itself.

   Assume, for example, that I have a list of people and their addresses
and for each person I want the ten closest members of the set. (this is not
the problem but illustrated critical aspects). I know that the ten closest
people will be in the same zipcode or a neighboring zip code. This means
unless the database is very large I can have the mapper send every person
out with keys representing  their zipcode and also keys representing the
neighboring zip codes. In the reducer I can keep all people in memory and
compute distances between them (assume the distance computation is slightly
expensive).
   The problem is that this approach will not scale - eventually the number
of people assigned to a zip code will exceed memory. In the current problem
the number of "people" is about 100 million and doubling every 6 months.
The size of a "zipcode" requires keeping about 100,000 items in memory -
doable today but marginal in terms of future growth.
   Are there other ways to solve the problem. I considered keeping a random
subset, finding the closest in that subset and then repeating with
different random subsets. The solution of midifying the splitter to
generate all pairs
https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java
will
not work for a dataset with 100 million items
   Any bright ideas?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB