Re: Handling disparate data sources in Solr

Otis Gospodnetic Fri, 22 Dec 2006 19:38:34 -0800

Alan,

omitNorms let's you not use field norms for certain field when calculating 
document matching score.  This can save you some RAM.  See 
http://issues.apache.org/jira/browse/LUCENE-448 .
For position increment gap, have a look at 
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html#setPositionIncrement(int)
 , it is described pretty well there.

I don't know the answer to your main question, though.

Otis

----- Original Message ----
From: Alan Burlison <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, December 22, 2006 7:48:47 PM
Subject: Handling disparate data sources in Solr

Hi,

I'm considering using Solr to replace an existing bare-metal Lucene 
deployment - the current Lucene setup is embedded inside an existing 
monolithic webapp, and I want to factor out the search functionality 
into a separate webapp so it can be reused more easily.

At present the content of the Lucene index comes from many different 
sources (web pages, documents, blog posts etc) and can be different 
formats (plaintext, HTML, PDF etc).  All the various content types are 
rendered to plaintext before being inserted into the Lucene index.

The net result is that the data in one field in the index (say 
"content") may have come from one of a number of source document types. 
  I'm having difficulty understanding how I might map this functionality 
onto Solr.  I understand how (for example) I could use 
HTMLStripStandardTokenizer to insert the contents of a HTML document 
into a field called "content", but (assuming I'd written a PDF analyser) 
how would I insert the content of a PDF document into the same "content" 
field?

I know I could do this by preprocessing the various document types to 
plaintext in the various Solr clients before inserting the data into the 
index, but that means that each client would need to know how to do the 
document transformation.  As well as centralising the index, I also want 
to centralise the handling of the different document types.

Another question:

What do "omitNorms" and "positionIncrementGap" mean in the schema.xml 
file?  The documentation is vague to say the least, and google wasn't 
much more helpful.

Thanks,

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Reply via email to