Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> unsort algorithmus in map/reduce


Copy link to this message
-
Re: unsort algorithmus in map/reduce
> If on the other hand, you want to guarantee that you don't swamp the
servers on each domain and you are trying to throttle
 > your fetchers, then you want to do something like re-write the urls
to be backwards:
>
> com.test.www/http/page1.html
> com.test.www/http/page2.html
> com.test.www/http/page3.html
> com.test2.www/http/page1.html
> com.test2.www/http/page2.html
I didnt get why they have to be backwards because if we are interested
in URL queue  distance from same origin server then distance is same.

or you wanted to reverse them like

page1.html/com.test.www/http
page1.html/com.test2.www/http

then i am not sure if this ordering is better then pure random or md5.

> and use a total ordering of the sort. (You'll need to sample the data
> to pick the cut points.) That will limit each site to one or
> occasionally two mappers and thus the maximum number of concurrent
> fetchers will be the number of threads in each mapper.
I need to spread site between as much mappers as possible because there
is crawl delay between requests per site.