On Tue, Oct 25, 2011 at 8:35 AM, Radim Kolar <[EMAIL PROTECTED]> wrote:
> Dne 25.10.2011 14:21, Niels Basjes napsal(a):
> Why not do something very simple: Use the MD5 of the URL as the key you do
>> the sorting by.
>> This scales very easy and highly randomized order.
>> Maybe not the optimal maximum distance, but certainly a very good
>> distribution and very easy to built.
> I tried it and problem is that sites with lot of URLs block queue. You can
> have few sites with 5m urls and they take major portion of queue and small
> sites are not crawled.
If you are trying to spread out the workload for a given site, sorting the
urls into md5 order as Niels said, is probably your best option. (Don't
forget to use the multi-threaded mapper!)
If on the other hand, you want to guarantee that you don't swamp the servers
on each domain and you are trying to throttle your fetchers, then you want
to do something like re-write the urls to be backwards:
and use a total ordering of the sort. (You'll need to sample the data to
pick the cut points.) That will limit each site to one or occasionally two
mappers and thus the maximum number of concurrent fetchers will be the
number of threads in each mapper.