-Re: unsort algorithmus in map/reduce
Radim Kolar 2011-10-27, 09:36
> If on the other hand, you want to guarantee that you don't swamp the
servers on each domain and you are trying to throttle
> your fetchers, then you want to do something like re-write the urls
to be backwards:
I didnt get why they have to be backwards because if we are interested
in URL queue distance from same origin server then distance is same.
or you wanted to reverse them like
then i am not sure if this ordering is better then pure random or md5.
> and use a total ordering of the sort. (You'll need to sample the data
> to pick the cut points.) That will limit each site to one or
> occasionally two mappers and thus the maximum number of concurrent
> fetchers will be the number of threads in each mapper.
I need to spread site between as much mappers as possible because there
is crawl delay between requests per site.