Hi, this question is a little off topic, but I thought since so many people on this are probably experts in this field, someone may know.
I'm experimenting with my own semantic-based search engine, but I want to test it with a large corpus of web pages. Ideally I would like to have a list of the top 10M or top 100M page-ranked URL's in the world. Short of using Nutch to crawl the entire web and build this page-rank, is there any other ways? What other ways or resources might be available for me to get this (smaller) corpus of top webpages? Thanks, Ian