Hi Erik,

I'm working on Wikipedia search and use Solr. Afaik it can't easily be
done. The Wikipedia XML dump only provided the page title and author
in terms of data one would search for. The rest requires parsing the
Mediawiki markup for which there is no good one freely available
(still writing my own). If you are happy with individual pages you
could go with the HTML parser.

For the second part of your question, why don't you let them try
competing tokenization strategies (with and w/o stemming etc.) and
compare?

-Chris


On Wed, Mar 24, 2010 at 3:40 PM, Erik Hatcher <erik.hatc...@gmail.com> wrote:
> I've got a couple of questions for the community...
>
>  * what's the simplest way to get Solr up and running with a relatively
> richly schema'd index of a Wikipedia dump?
>
> What I'm looking for is something as easy as something along these lines:
>
>  java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
>
>  cat wikipedia.bz2 | wikipedia_solr_indexer
>
> My goal is to index wikipedia in order to demonstrate search to a class of
> middle school kids that I've volunteered to teach for a couple of hours.
>  Which brings me to my next question...
>
>  * anyone have ideas on some basic hands-on ways of teaching search engine
> fundamentals?
>
> One idea I have is to bring some actual "documents", say a poster board with
> a sentence written largely on it, have the students physically *tokenize*
> the document by cutting it up and lexicographically building the term
> dictionary.  Thoughts on taking it further welcome!
>
> Thanks all.
>
>        Erik
>
>

Reply via email to