Re: getting a list of top page-ranked webpages

Ken Krugler Thu, 16 Sep 2010 15:31:51 -0700

Hi Ian,

On Sep 16, 2010, at 2:44pm, Ian Upright wrote:

Hi, this question is a little off topic, but I thought since so manypeople
on this are probably experts in this field, someone may know.
I'm experimenting with my own semantic-based search engine, but Iwant totest it with a large corpus of web pages. Ideally I would like tohave a
list of the top 10M or top 100M page-ranked URL's in the world.
Short of using Nutch to crawl the entire web and build this page-rank, isthere any other ways? What other ways or resources might beavailable for
me to get this (smaller) corpus of top webpages?

The public terabyte dataset project would be a good match for what youneed.


http://bixolabs.com/datasets/public-terabyte-dataset-project/

Of course, that means we have to actually finish the crawl & finalizethe Avro format we use for the data :)

There are other free collections of data around, though none that Iknow of which target top-ranked pages.


-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: getting a list of top page-ranked webpages

Reply via email to