Original problem statement:
----------
I'm considering using Solr to replace an existing bare-metal Lucene
deployment - the current Lucene setup is embedded inside an existing
monolithic webapp, and I want to factor out the search functionality
into a separate webapp so it can be reused more easily.
At present the content of the Lucene index comes from many different
sources (web pages, documents, blog posts etc) and can be different
formats (plaintext, HTML, PDF etc). All the various content types are
rendered to plaintext before being inserted into the Lucene index.
The net result is that the data in one field in the index (say
"content") may have come from one of a number of source document types.
I'm having difficulty understanding how I might map this functionality
onto Solr. I understand how (for example) I could use
HTMLStripStandardTokenizer to insert the contents of a HTML document
into a field called "content", but (assuming I'd written a PDF analyser)
how would I insert the content of a PDF document into the same "content"
field?
I know I could do this by preprocessing the various document types to
plaintext in the various Solr clients before inserting the data into the
index, but that means that each client would need to know how to do the
document transformation. As well as centralising the index, I also want
to centralise the handling of the different document types.
----------
My initial suggestion, to get the discussion started, is to extend the
<doc> and <field> element with the following attributes:
mime-type
Mime type of the document, e.g. application/pdf, text/html and so on.
encoding
Encoding of the document, with base64 being the standard implementation.
href
The URL of any documents that can be accessed over HTTP, instead of
embedding them in the indexing request. The indexer would fetch the
document using the specified URL.
There would then be entries in the configuration file that map each MIME
type to a handler that is capable of dealing with that document type.
Thoughts?
--
Alan Burlison
--