> If on the other hand, you want to guarantee that you don't swamp the
servers on each domain and you are trying to throttle
> your fetchers, then you want to do something like re-write the urls
to be backwards:
com.test.www/http/page1.html
com.test.www/http/page2.html
com.test.www/http/page3.html
com.test2.www/http/page1.html
com.test2.www/http/page2.html
I didnt get why they have to be backwards because if we are interested
in URL queue distance from same origin server then distance is same.
or you wanted to reverse them like
page1.html/com.test.www/http
page1.html/com.test2.www/http
then i am not sure if this ordering is better then pure random or md5.
and use a total ordering of the sort. (You'll need to sample the data
to pick the cut points.) That will limit each site to one or
occasionally two mappers and thus the maximum number of concurrent
fetchers will be the number of threads in each mapper.
I need to spread site between as much mappers as possible because there
is crawl delay between requests per site.