Re: Handling disparate data sources in Solr

Mike Klaas Fri, 22 Dec 2006 22:06:15 -0800

On 12/22/06, Alan Burlison <[EMAIL PROTECTED]> wrote:

At present the content of the Lucene index comes from many different
sources (web pages, documents, blog posts etc) and can be different
formats (plaintext, HTML, PDF etc).  All the various content types are
rendered to plaintext before being inserted into the Lucene index.

The net result is that the data in one field in the index (say
"content") may have come from one of a number of source document types.
  I'm having difficulty understanding how I might map this functionality
onto Solr.  I understand how (for example) I could use
HTMLStripStandardTokenizer to insert the contents of a HTML document
into a field called "content", but (assuming I'd written a PDF analyser)
how would I insert the content of a PDF document into the same "content"
field?


You could do it in Solr.  The difficulty is that arbitrary binary data
is not easily transferred via xml.  So you must specify that the input
is in base64 or some other encoding.  Then you could decode it on the
fly using a custom Analyzer before passing it along.

It might be easier to do this outside of solr, but still in a
centralized manner.  Write another webapp which accepts files.   It
will decode them appropriately and pass them along to the solr
instance in the same container.  Then your client don't even need to
know how to talk to solr.

-Mike

Re: Handling disparate data sources in Solr

Reply via email to