getting a list of top page-ranked webpages

2010-09-16 Thread Ian Upright
Hi, this question is a little off topic, but I thought since so many people on this are probably experts in this field, someone may know. I'm experimenting with my own semantic-based search engine, but I want to test it with a large corpus of web pages. Ideally I would like to have a list of the

Re: getting a list of top page-ranked webpages

2010-09-17 Thread Ian Upright
On Fri, 17 Sep 2010 04:46:44 -0700 (PDT), kenf_nc wrote: >A slightly different route to take, but one that should help test/refine a >semantic parser is wikipedia. They make available their entire corpus, or >any subset you define. The whole thing is like 14 terabytes, but you can get >smaller se

Re: getting a list of top page-ranked webpages

2010-09-17 Thread Ian Upright
On Thu, 16 Sep 2010 15:31:02 -0700, you wrote: >The public terabyte dataset project would be a good match for what you >need. > >http://bixolabs.com/datasets/public-terabyte-dataset-project/ > >Of course, that means we have to actually finish the crawl & finalize >the Avro format we use for th

Solr rate limiting / DoS attacks

2010-09-29 Thread Ian Upright
Hi, I'm curious as to what approaches one would take to defend against users attacking a Solr service, especially if exposed to the internet as opposed to an intranet. I'm fairly new to Solr, is there anything built in? Is there anything in place to prevent the search engine from getting overwhel