: > You could do it in Solr. The difficulty is that arbitrary binary data : > is not easily transferred via xml. So you must specify that the input : > is in base64 or some other encoding. Then you could decode it on the : > fly using a custom Analyzer before passing it along. : : Why won't cdata work?
because your binary data might the byte sequence: 0x5D 0x5D 0x3E -- indicating hte end of the CDATA section. CDATA is short for "Charatacter DATA" -- you can't put arbitrary binary data in (or even arbitrary text in it) and be sure thta it will work. : > It might be easier to do this outside of solr, but still in a : > centralized manner. Write another webapp which accepts files. It : > will decode them appropriately and pass them along to the solr : > instance in the same container. Then your client don't even need to : > know how to talk to solr. : : In that case there's little point in using Solr at all - the main : benefit it gives me is that I don't have to write all the HTTP protocol : bits. If I have to do that myself I might as well use raw Luceme - and : in fact that's how the existing system works. For your purposes, if you've got a system that works and does the Document conversion for you, then you are probably right: Solr may not be a usefull addition to your architecture. Solr doesn't really attempt to solve the problem of parsing differnet kinds of data streams into a unified Document module -- it just tries to expose all of the Lucene goodness through an easy to use, easy to configre, HTTP interface. Besides the configuration, Solr's other means of being a value add is in it's IndexReader management, it's caching, and it's plugin support for mixing and matching request handlers, output writters, and field types as easily as you can mix and match Analyzers. There has been some discussion about adding plugin support for the "update" side of things as well -- at a very simple level this could allow for messages to be sent via JSON, or CSV instead of just XML -- but there's no reason a more comple upate plugin couldn't read in a binary PDF file and parse it into it's appropriate fields ... but we aren't quite there yet. Feel free to bring this up on solr-dev if you'd be interested in working on it. -Hoss