> If on the other hand, you want to guarantee that you don't swamp the servers on each domain and you are trying to throttle > your fetchers, then you want to do something like re-write the urls to be backwards:

com.test.www/http/page1.html
com.test.www/http/page2.html
com.test.www/http/page3.html
com.test2.www/http/page1.html
com.test2.www/http/page2.html
I didnt get why they have to be backwards because if we are interested in URL queue distance from same origin server then distance is same.

or you wanted to reverse them like

page1.html/com.test.www/http
page1.html/com.test2.www/http

then i am not sure if this ordering is better then pure random or md5.

and use a total ordering of the sort. (You'll need to sample the data to pick the cut points.) That will limit each site to one or occasionally two mappers and thus the maximum number of concurrent fetchers will be the number of threads in each mapper.
I need to spread site between as much mappers as possible because there is crawl delay between requests per site.

Reply via email to