|
|
-
unsort algorithmus in map/reduce
Radim Kolar 2011-10-25, 10:45
Hi, i am having problem implementing unsort for crawler in map/reduce. I have list of URLs waiting to fetch, they needs to be reordered for maximum distance between URLs from one domain. idea is to do map URL -> domain, URL test.com, http://www.test.com/page1.html test.com, http://www.test.com/page2.html test.com, http://www.test.com/page3.html test2.com, http://www.test2.com/page1.html test2.com, http://www.test2.com/page2.html test2.com, http://www.test2.com/page3.html reduce test.com, <list> -> priority, URL 10, http://www.test.com/page1.html 9, http://www.test.com/page2.html 8, http://www.test.com/page3.html10, http://www.test2.com/page1.html 9, http://www.test2.com/page2.html 8, http://www.test2.com/page3.htmlNow i need to order output by key 10, http://www.test.com/page1.html10, http://www.test2.com/page1.html 9, http://www.test.com/page2.html 9, http://www.test2.com/page2.html 8, http://www.test.com/page3.html 8, http://www.test2.com/page3.htmland write list of URLs in this order to output files. Like 50k urls to file1, next 50k to file2 and so on. Can you give me an idea how to sort using mapred and how to process sorted data and split them into files?
-
Re: unsort algorithmus in map/reduce
Niels Basjes 2011-10-25, 12:21
Why not do something very simple: Use the MD5 of the URL as the key you do the sorting by. This scales very easy and highly randomized order. Maybe not the optimal maximum distance, but certainly a very good distribution and very easy to built. Niels Basjes 2011/10/25 Radim Kolar <[EMAIL PROTECTED]> > Hi, i am having problem implementing unsort for crawler in map/reduce. > > I have list of URLs waiting to fetch, they needs to be reordered for > maximum distance between URLs from one domain. > > idea is to do > map URL -> domain, URL > > test.com, http://www.test.com/page1.html> test.com, http://www.test.com/page2.html> test.com, http://www.test.com/page3.html> test2.com, http://www.test2.com/page1.**html<http://www.test2.com/page1.html>> test2.com, http://www.test2.com/page2.**html<http://www.test2.com/page2.html>> test2.com, http://www.test2.com/page3.**html<http://www.test2.com/page3.html>> > reduce test.com, <list> -> priority, URL > > 10, http://www.test.com/page1.html> 9, http://www.test.com/page2.html> 8, http://www.test.com/page3.html> 10, http://www.test2.com/page1.**html < http://www.test2.com/page1.html>> 9, http://www.test2.com/page2.**html < http://www.test2.com/page2.html>> 8, http://www.test2.com/page3.**html < http://www.test2.com/page3.html>> > > Now i need to order output by key > > 10, http://www.test.com/page1.html> 10, http://www.test2.com/page1.**html < http://www.test2.com/page1.html>> 9, http://www.test.com/page2.html> 9, http://www.test2.com/page2.**html < http://www.test2.com/page2.html>> 8, http://www.test.com/page3.html> 8, http://www.test2.com/page3.**html < http://www.test2.com/page3.html>> > and write list of URLs in this order to output files. Like 50k urls to > file1, next 50k to file2 and so on. > > Can you give me an idea how to sort using mapred and how to process sorted > data and split them into files? > -- Best regards / Met vriendelijke groeten, Niels Basjes
-
Re: unsort algorithmus in map/reduce
Radim Kolar 2011-10-25, 15:35
Dne 25.10.2011 14:21, Niels Basjes napsal(a): > Why not do something very simple: Use the MD5 of the URL as the key > you do the sorting by. > This scales very easy and highly randomized order. > Maybe not the optimal maximum distance, but certainly a very good > distribution and very easy to built. I tried it and problem is that sites with lot of URLs block queue. You can have few sites with 5m urls and they take major portion of queue and small sites are not crawled.
-
Re: unsort algorithmus in map/reduce
Owen O'Malley 2011-10-25, 16:12
On Tue, Oct 25, 2011 at 8:35 AM, Radim Kolar <[EMAIL PROTECTED]> wrote:
> Dne 25.10.2011 14:21, Niels Basjes napsal(a): > > Why not do something very simple: Use the MD5 of the URL as the key you do >> the sorting by. >> This scales very easy and highly randomized order. >> Maybe not the optimal maximum distance, but certainly a very good >> distribution and very easy to built. >> > I tried it and problem is that sites with lot of URLs block queue. You can > have few sites with 5m urls and they take major portion of queue and small > sites are not crawled. >
If you are trying to spread out the workload for a given site, sorting the urls into md5 order as Niels said, is probably your best option. (Don't forget to use the multi-threaded mapper!)
If on the other hand, you want to guarantee that you don't swamp the servers on each domain and you are trying to throttle your fetchers, then you want to do something like re-write the urls to be backwards:
com.test.www/http/page1.html com.test.www/http/page2.html com.test.www/http/page3.html com.test2.www/http/page1.html com.test2.www/http/page2.html
and use a total ordering of the sort. (You'll need to sample the data to pick the cut points.) That will limit each site to one or occasionally two mappers and thus the maximum number of concurrent fetchers will be the number of threads in each mapper.
-- Owen
-
Re: unsort algorithmus in map/reduce
Radim Kolar 2011-10-27, 09:36
> If on the other hand, you want to guarantee that you don't swamp the servers on each domain and you are trying to throttle > your fetchers, then you want to do something like re-write the urls to be backwards: > > com.test.www/http/page1.html > com.test.www/http/page2.html > com.test.www/http/page3.html > com.test2.www/http/page1.html > com.test2.www/http/page2.html I didnt get why they have to be backwards because if we are interested in URL queue distance from same origin server then distance is same.
or you wanted to reverse them like
page1.html/com.test.www/http page1.html/com.test2.www/http
then i am not sure if this ordering is better then pure random or md5.
> and use a total ordering of the sort. (You'll need to sample the data > to pick the cut points.) That will limit each site to one or > occasionally two mappers and thus the maximum number of concurrent > fetchers will be the number of threads in each mapper. I need to spread site between as much mappers as possible because there is crawl delay between requests per site.
|
|