That's pretty good stuff to know, thanks everybody. For my application, it's pretty hard to do crawling and universally assign desired fields from the text returned.
However, I would WELCOME someone with that expertise into the company when it gets funded, to prove me wrong :-) Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Ian Upright <i...@upright.net> wrote: > From: Ian Upright <i...@upright.net> > Subject: Re: getting a list of top page-ranked webpages > To: solr-user@lucene.apache.org > Date: Friday, September 17, 2010, 10:50 AM > On Fri, 17 Sep 2010 04:46:44 -0700 > (PDT), kenf_nc > <ken.fos...@realestate.com> > wrote: > > >A slightly different route to take, but one that should > help test/refine a > >semantic parser is wikipedia. They make available their > entire corpus, or > >any subset you define. The whole thing is like 14 > terabytes, but you can get > >smaller sets. > > Actually, I do heavy analysis of the entire wikipedia, plus > 1m top webpages > from Alexa, and all of dmoz url's, in order to build the > semantic engine in > the first place. However, an outside corpus is > required to test it's > quality outside of this space. > > Cheers, Ian >