I've been investigating Solr on and off as a (or even the) search
solution for my employer's content management solution. One of the
biggest questions in my mind at this point is which version to go
with. In general, 1.4 would seem the obvious choice, as it's the only
released version on that list. There's a commercially supported distro
from Lucid, and things should presumably be pretty stable.

What led me down the rabbit hole is that a) we generally have quite a
lot of business documents to index (Word and PDF, mostly), and b) the
"pull" approach implemented in the DataImportHandler is much more
attractive in our architecture than the "push" model we'd otherwise
have to contruct. Unfortunately, the TikaEntityProcessor and the
binary data sources on which it depends were added after 1.4 was
released.

Back in early March, I was able to get things up and running with a
1.5 nightly (and Tika 0.7-snapshot), but since then the course of Solr
development has... changed significantly. The 1.5 branch has been
abandoned, and (to my uninformed eye) it seems that there's a lot of
upheaval in the trunk as things merge with Lucene. And it also appears
that the released Tika 0.7 might not be compatible with Solr? (Judging
by SOLR-1902, that is.)

What I'm looking for is some advice on what course to pursue:
- Plunge ahead with the trunk, and hope that things stabilize by a few
months from now, when we'd be hoping to go live on one of our biggest
client sites.
- Go with the last 1.5 code, knowing that the features we want are in
there, and hope we don't run into anything majorly broken.
- Stick with 1.4, and just accept the necessity of needing to push
content to the HTTP interface.

I don't expect a definitive answer, of course, but I'd like to be
better informed about the risks and benefits.

Also: does anyone have a sense whether it'd be possible to back-port
the TikaEntityProcessor stuff to 1.4?

Sixten

Reply via email to