On Thu, 16 Sep 2010 15:31:02 -0700, you wrote:

>The public terabyte dataset project would be a good match for what you  
>need.
>
>http://bixolabs.com/datasets/public-terabyte-dataset-project/
>
>Of course, that means we have to actually finish the crawl & finalize  
>the Avro format we use for the data :)
>
>There are other free collections of data around, though none that I  
>know of which target top-ranked pages.
>
>-- Ken

Hi Ken.. this looks exactly like what i need.  There is the ClueWeb dataset,
http://boston.lti.cs.cmu.edu/Data/clueweb09/   However, one must buy it from
them, the crawl was done in 09, and it inclues a number of hard drives which
are shipped to you.  Any crawl that would be available as an Amazon Public
Dataset would be totally perfect.

Ian

Reply via email to