A bit off-topic but how about Nutch grabbing some conent and have it indexed 
in Solr?

On Wednesday 24 March 2010 16:08:43 Christopher Laux wrote:
> Hi Erik,
> 
> I'm working on Wikipedia search and use Solr. Afaik it can't easily be
> done. The Wikipedia XML dump only provided the page title and author
> in terms of data one would search for. The rest requires parsing the
> Mediawiki markup for which there is no good one freely available
> (still writing my own). If you are happy with individual pages you
> could go with the HTML parser.
> 
> For the second part of your question, why don't you let them try
> competing tokenization strategies (with and w/o stemming etc.) and
> compare?
> 
> -Chris
> 
> On Wed, Mar 24, 2010 at 3:40 PM, Erik Hatcher <erik.hatc...@gmail.com> 
wrote:
> > I've got a couple of questions for the community...
> >
> >  * what's the simplest way to get Solr up and running with a relatively
> > richly schema'd index of a Wikipedia dump?
> >
> > What I'm looking for is something as easy as something along these lines:
> >
> >  java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
> >
> >  cat wikipedia.bz2 | wikipedia_solr_indexer
> >
> > My goal is to index wikipedia in order to demonstrate search to a class
> > of middle school kids that I've volunteered to teach for a couple of
> > hours. Which brings me to my next question...
> >
> >  * anyone have ideas on some basic hands-on ways of teaching search
> > engine fundamentals?
> >
> > One idea I have is to bring some actual "documents", say a poster board
> > with a sentence written largely on it, have the students physically
> > *tokenize* the document by cutting it up and lexicographically building
> > the term dictionary.  Thoughts on taking it further welcome!
> >
> > Thanks all.
> >
> >        Erik
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to