A bit off-topic but how about Nutch grabbing some conent and have it indexed in Solr?
On Wednesday 24 March 2010 16:08:43 Christopher Laux wrote: > Hi Erik, > > I'm working on Wikipedia search and use Solr. Afaik it can't easily be > done. The Wikipedia XML dump only provided the page title and author > in terms of data one would search for. The rest requires parsing the > Mediawiki markup for which there is no good one freely available > (still writing my own). If you are happy with individual pages you > could go with the HTML parser. > > For the second part of your question, why don't you let them try > competing tokenization strategies (with and w/o stemming etc.) and > compare? > > -Chris > > On Wed, Mar 24, 2010 at 3:40 PM, Erik Hatcher <erik.hatc...@gmail.com> wrote: > > I've got a couple of questions for the community... > > > > * what's the simplest way to get Solr up and running with a relatively > > richly schema'd index of a Wikipedia dump? > > > > What I'm looking for is something as easy as something along these lines: > > > > java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar > > > > cat wikipedia.bz2 | wikipedia_solr_indexer > > > > My goal is to index wikipedia in order to demonstrate search to a class > > of middle school kids that I've volunteered to teach for a couple of > > hours. Which brings me to my next question... > > > > * anyone have ideas on some basic hands-on ways of teaching search > > engine fundamentals? > > > > One idea I have is to bring some actual "documents", say a poster board > > with a sentence written largely on it, have the students physically > > *tokenize* the document by cutting it up and lexicographically building > > the term dictionary. Thoughts on taking it further welcome! > > > > Thanks all. > > > > Erik > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350