On Thu, 16 Sep 2010 15:31:02 -0700, you wrote: >The public terabyte dataset project would be a good match for what you >need. > >http://bixolabs.com/datasets/public-terabyte-dataset-project/ > >Of course, that means we have to actually finish the crawl & finalize >the Avro format we use for the data :) > >There are other free collections of data around, though none that I >know of which target top-ranked pages. > >-- Ken
Hi Ken.. this looks exactly like what i need. There is the ClueWeb dataset, http://boston.lti.cs.cmu.edu/Data/clueweb09/ However, one must buy it from them, the crawl was done in 09, and it inclues a number of hard drives which are shipped to you. Any crawl that would be available as an Amazon Public Dataset would be totally perfect. Ian