On 03/24/2010 10:40 AM, Erik Hatcher wrote:
I've got a couple of questions for the community...
* what's the simplest way to get Solr up and running with a
relatively richly schema'd index of a Wikipedia dump?
What I'm looking for is something as easy as something along these lines:
java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
cat wikipedia.bz2 | wikipedia_solr_indexer
My goal is to index wikipedia in order to demonstrate search to a
class of middle school kids that I've volunteered to teach for a
couple of hours. Which brings me to my next question...
* anyone have ideas on some basic hands-on ways of teaching search
engine fundamentals?
One idea I have is to bring some actual "documents", say a poster
board with a sentence written largely on it, have the students
physically *tokenize* the document by cutting it up and
lexicographically building the term dictionary. Thoughts on taking it
further welcome!
Thanks all.
Erik
For what its worth, this is what I use. Its probably one of the fastest
methods out there.
It uses embedded Solr and multiple threads to process either an expanded
wiki dump, or a bz2 compressed dump.
Simply apply the following patch to Solr trunk:
http://pastebin.com/raw.php?i=Q5PR261W
And add commons-compress jar to solr/lib:
http://mirrors.axint.net/apache/commons/compress/binaries/commons-compress-1.0-bin.zip
Then run with ant by specifying the wikidump (like what you can get
here: http://download.wikimedia.org/enwiki/20100312/)
ant wikipedia
-Dwiki-file=/home/mark/wikidumps/enwiki-latest-pages-articles.xml.bz2
Other properties you can pass:
-Dnum-docs=300 : defaults to 10000 - use max integer (or just something
really high) to process the whole file
-Dnum-threads=2 : defaults to number of processore/cores - 1
-Dsolr.home={solrhomepath} : defaults to example/solr
This processes the wiki-dump in the same manner as the Lucene benchmark
contrib - so not super deep - like text, title, date and one or two
others I think. More could be added though, though I don't think
anything else is easy pickings.
--
- Mark
http://www.lucidimagination.com