Internet Archive on Monday afternoon switched over to SOLR!
We converted from a badly deteriorating "home grown" server that
was made up of java + jetty ( + rsync for replication) + an older
version of lucene.
I make some comparisons of SOLR vs. "prior" using "[]" notes below.
I parsed 2 days worth of SOLR logs to determine:
Max queries/sec: 8.8
Avg queries/sec: 5.4
Number (re)indexed / day: 3372
Index size: 1.1gb [vs. 26gb]
Number of document fields searched on a quoted unqualified query:
5 [vs. 677] *
Horsepower:
one 4gb RAM dual core cpu
[vs. three 4gb RAM dual core cpu (readers) and one 8gb RAM 2 dual
core cpus (writer)]
Solr hardly touches our disks, load avg stays around 0.5, typically.
"sar" shows we average 85% idle!
Solr seems quicker to respond, overall, and much more stable.
We can reindex our entire set of 575K items in about 2 hours
(where we are limited more by the "crawling" of our 190 servers for
XML than Solr).
With our current configuration, we can show index changes on our live
site in < 15 minutes
(compared to our last SE which could take 4+ hours)
Related to above point, we commit every 15 minutes; we optimize
once/day late at night.
* To be fair, Michael StAck (our greatest help for prior SE "life
support")
has smartly pointed out that by making a smarter schema and strategy,
I could
reduce the number of fields searched from 677 to 5, with the same overall
functionality. 677 fields search on most queries was surely part of
bucket
of nails in the coffin of our prior SE.
[Some information and configuration]
We've done essentially no optimizing outside of focusing on a "smart"
schema.
We do query-time boosting (more on that follows).
We (presently) do not use replication.
We do (server-side) XSLT of output into our prior SE's XML format.
We don't use DisMax and (as of now) do not use faceting.
We override defaultOperator of "OR" to "AND".
We increased our commitLockTimeout to 5 minutes, and unlockOnStartup.
We useCompoundFile (for the index).
External to Solr, we use XSLT to transform our item XML into a
post-able form for Solr to (re)index.
And finally, the hardest part to convert to Solr.
I had to write a PHP front-end custom converter to take our query strings,
parse the clauses and lucene syntax into pieces, and "expand" clauses
where
they were not searching a specific field to expand it to our
query-time boosting.
Eg: if someone were to look for "tracey pooh" on our site, we expand
it to:
(title:"tracey pooh"^100 OR description:"tracey pooh"^15 OR
collection:"tracey pooh"^10 OR language:"tracey pooh"^10 OR text:"tracey
pooh"^1)
(but 'creator:"tracey pooh"' would pass to SOLR as is).
Lastly, a feelgood. All of Internet Archive's written code is opensource,
as is *all* the third-party code we use!
So go SOLR and thank you SO much for keeping it open, keeping it real,
and for *saving our site*!
Thanks for the great mail list and all the continual work, updating,
and thinking the Solr
team continues to do. We have all been greatly impressed by this
project and it has worked out
better than we had hoped!
--
* --Tracey Jaquith - http://www.archive.org/~tracey
<http://www.archive.org/%7Etracey> --*