Handling disparate data sources in Solr

Alan Burlison Fri, 22 Dec 2006 16:48:47 -0800

Hi,

I'm considering using Solr to replace an existing bare-metal Lucenedeployment - the current Lucene setup is embedded inside an existingmonolithic webapp, and I want to factor out the search functionalityinto a separate webapp so it can be reused more easily.

At present the content of the Lucene index comes from many differentsources (web pages, documents, blog posts etc) and can be differentformats (plaintext, HTML, PDF etc). All the various content types arerendered to plaintext before being inserted into the Lucene index.

The net result is that the data in one field in the index (say"content") may have come from one of a number of source document types.I'm having difficulty understanding how I might map this functionalityonto Solr. I understand how (for example) I could useHTMLStripStandardTokenizer to insert the contents of a HTML documentinto a field called "content", but (assuming I'd written a PDF analyser)how would I insert the content of a PDF document into the same "content"field?

I know I could do this by preprocessing the various document types toplaintext in the various Solr clients before inserting the data into theindex, but that means that each client would need to know how to do thedocument transformation. As well as centralising the index, I also wantto centralise the handling of the different document types.


Another question:

What do "omitNorms" and "positionIncrementGap" mean in the schema.xmlfile? The documentation is vague to say the least, and google wasn'tmuch more helpful.


Thanks,

--
Alan Burlison
--

Handling disparate data sources in Solr

Reply via email to