I've been investigating Solr on and off as a (or even the) search solution for my employer's content management solution. One of the biggest questions in my mind at this point is which version to go with. In general, 1.4 would seem the obvious choice, as it's the only released version on that list. There's a commercially supported distro from Lucid, and things should presumably be pretty stable.
What led me down the rabbit hole is that a) we generally have quite a lot of business documents to index (Word and PDF, mostly), and b) the "pull" approach implemented in the DataImportHandler is much more attractive in our architecture than the "push" model we'd otherwise have to contruct. Unfortunately, the TikaEntityProcessor and the binary data sources on which it depends were added after 1.4 was released. Back in early March, I was able to get things up and running with a 1.5 nightly (and Tika 0.7-snapshot), but since then the course of Solr development has... changed significantly. The 1.5 branch has been abandoned, and (to my uninformed eye) it seems that there's a lot of upheaval in the trunk as things merge with Lucene. And it also appears that the released Tika 0.7 might not be compatible with Solr? (Judging by SOLR-1902, that is.) What I'm looking for is some advice on what course to pursue: - Plunge ahead with the trunk, and hope that things stabilize by a few months from now, when we'd be hoping to go live on one of our biggest client sites. - Go with the last 1.5 code, knowing that the features we want are in there, and hope we don't run into anything majorly broken. - Stick with 1.4, and just accept the necessity of needing to push content to the HTTP interface. I don't expect a definitive answer, of course, but I'd like to be better informed about the risks and benefits. Also: does anyone have a sense whether it'd be possible to back-port the TikaEntityProcessor stuff to 1.4? Sixten