Re: Handling disparate data sources in Solr

Chris Hostetter Sat, 23 Dec 2006 15:38:50 -0800

: > You could do it in Solr.  The difficulty is that arbitrary binary data
: > is not easily transferred via xml.  So you must specify that the input
: > is in base64 or some other encoding.  Then you could decode it on the
: > fly using a custom Analyzer before passing it along.
:
: Why won't cdata work?


because your binary data might the byte sequence: 0x5D 0x5D 0x3E --
indicating hte end of the CDATA section. CDATA is short for "Charatacter
DATA" -- you can't put arbitrary binary data in (or even arbitrary text in
it) and be sure thta it will work.

: > It might be easier to do this outside of solr, but still in a
: > centralized manner.  Write another webapp which accepts files.   It
: > will decode them appropriately and pass them along to the solr
: > instance in the same container.  Then your client don't even need to
: > know how to talk to solr.
:
: In that case there's little point in using Solr at all - the main
: benefit it gives me is that I don't have to write all the HTTP protocol
: bits.  If I have to do that myself I might as well use raw Luceme - and
: in fact that's how the existing system works.

For your purposes, if you've got a system that works and does the Document
conversion for you, then you are probably right: Solr may not be a usefull
addition to your architecture.  Solr doesn't really attempt to solve the
problem of parsing differnet kinds of data streams into a unified Document
module -- it just tries to expose all of the Lucene goodness through an
easy to use, easy to configre, HTTP interface.  Besides the
configuration, Solr's other means of being a value add is in it's
IndexReader management, it's caching, and it's plugin support for mixing
and matching request handlers, output writters, and field types as easily
as you can mix and match Analyzers.

There has been some discussion about adding plugin support for the
"update" side of things as well -- at a very simple level this could allow
for messages to be sent via JSON, or CSV instead of just XML -- but
there's no reason a more comple upate plugin couldn't read in a binary PDF
file and parse it into it's appropriate fields ... but we aren't
quite there yet.  Feel free to bring this up on solr-dev if you'd be
interested in working on it.


-Hoss

Re: Handling disparate data sources in Solr

Reply via email to