Hi Ian,
On Sep 16, 2010, at 2:44pm, Ian Upright wrote:
Hi, this question is a little off topic, but I thought since so many
people
on this are probably experts in this field, someone may know.
I'm experimenting with my own semantic-based search engine, but I
want to
test it with a large corpus of web pages. Ideally I would like to
have a
list of the top 10M or top 100M page-ranked URL's in the world.
Short of using Nutch to crawl the entire web and build this page-
rank, is
there any other ways? What other ways or resources might be
available for
me to get this (smaller) corpus of top webpages?
The public terabyte dataset project would be a good match for what you
need.
http://bixolabs.com/datasets/public-terabyte-dataset-project/
Of course, that means we have to actually finish the crawl & finalize
the Avro format we use for the data :)
There are other free collections of data around, though none that I
know of which target top-ranked pages.
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g