Re: Handling disparate data sources in Solr

Alan Burlison Sun, 24 Dec 2006 05:42:02 -0800

Chris Hostetter wrote:

: Why won't cdata work?


because your binary data might the byte sequence: 0x5D 0x5D 0x3E --
indicating hte end of the CDATA section. CDATA is short for "Charatacter
DATA" -- you can't put arbitrary binary data in (or even arbitrary text in
it) and be sure thta it will work.

Ok, so I have to escape ]]> - if it occurs - if I do that, why won't itwork?

For your purposes, if you've got a system that works and does the Document
conversion for you, then you are probably right: Solr may not be a usefull
addition to your architecture.  Solr doesn't really attempt to solve the
problem of parsing differnet kinds of data streams into a unified Document
module -- it just tries to expose all of the Lucene goodness through an
easy to use, easy to configre, HTTP interface.  Besides the
configuration, Solr's other means of being a value add is in it's
IndexReader management, it's caching, and it's plugin support for mixing
and matching request handlers, output writters, and field types as easily
as you can mix and match Analyzers.


Yes, it's all the crunchy goodness that I'm interested in ;-)

There has been some discussion about adding plugin support for the
"update" side of things as well -- at a very simple level this could allow
for messages to be sent via JSON, or CSV instead of just XML -- but
there's no reason a more comple upate plugin couldn't read in a binary PDF
file and parse it into it's appropriate fields ... but we aren't
quite there yet.  Feel free to bring this up on solr-dev if you'd be
interested in working on it.

Hmm. That's a possibility. It all depends on the time tradeoff betweenfixing what we have already to make it reusable versus extending Solr.


--
Alan Burlison
--

Re: Handling disparate data sources in Solr

Reply via email to