Hi Ian,

On Sep 16, 2010, at 2:44pm, Ian Upright wrote:

Hi, this question is a little off topic, but I thought since so many people
on this are probably experts in this field, someone may know.

I'm experimenting with my own semantic-based search engine, but I want to test it with a large corpus of web pages. Ideally I would like to have a
list of the top 10M or top 100M page-ranked URL's in the world.

Short of using Nutch to crawl the entire web and build this page- rank, is there any other ways? What other ways or resources might be available for
me to get this (smaller) corpus of top webpages?

The public terabyte dataset project would be a good match for what you need.

http://bixolabs.com/datasets/public-terabyte-dataset-project/

Of course, that means we have to actually finish the crawl & finalize the Avro format we use for the data :)

There are other free collections of data around, though none that I know of which target top-ranked pages.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to