Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> unsort algorithmus in map/reduce

Copy link to this message
Re: unsort algorithmus in map/reduce
Dne 25.10.2011 14:21, Niels Basjes napsal(a):
> Why not do something very simple: Use the MD5 of the URL as the key
> you do the sorting by.
> This scales very easy and highly randomized order.
> Maybe not the optimal maximum distance, but certainly a very good
> distribution and very easy to built.
I tried it and problem is that sites with lot of URLs block queue. You
can have few sites with 5m urls and they take major portion of queue and
small sites are not crawled.