On 03/24/2010 10:40 AM, Erik Hatcher wrote:
I've got a couple of questions for the community...

* what's the simplest way to get Solr up and running with a relatively richly schema'd index of a Wikipedia dump?

What I'm looking for is something as easy as something along these lines:

  java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar

  cat wikipedia.bz2 | wikipedia_solr_indexer

My goal is to index wikipedia in order to demonstrate search to a class of middle school kids that I've volunteered to teach for a couple of hours. Which brings me to my next question...

* anyone have ideas on some basic hands-on ways of teaching search engine fundamentals?

One idea I have is to bring some actual "documents", say a poster board with a sentence written largely on it, have the students physically *tokenize* the document by cutting it up and lexicographically building the term dictionary. Thoughts on taking it further welcome!

Thanks all.

    Erik


For what its worth, this is what I use. Its probably one of the fastest methods out there.

It uses embedded Solr and multiple threads to process either an expanded wiki dump, or a bz2 compressed dump.

Simply apply the following patch to Solr trunk: http://pastebin.com/raw.php?i=Q5PR261W

And add commons-compress jar to solr/lib: http://mirrors.axint.net/apache/commons/compress/binaries/commons-compress-1.0-bin.zip

Then run with ant by specifying the wikidump (like what you can get here: http://download.wikimedia.org/enwiki/20100312/)

ant wikipedia -Dwiki-file=/home/mark/wikidumps/enwiki-latest-pages-articles.xml.bz2

Other properties you can pass:

-Dnum-docs=300 : defaults to 10000 - use max integer (or just something really high) to process the whole file
-Dnum-threads=2 : defaults to number of processore/cores - 1
-Dsolr.home={solrhomepath} : defaults to example/solr


This processes the wiki-dump in the same manner as the Lucene benchmark contrib - so not super deep - like text, title, date and one or two others I think. More could be added though, though I don't think anything else is easy pickings.

--
- Mark

http://www.lucidimagination.com


Reply via email to