Hmmm, if I understand your question correctly, I think Lucene's
payloads are what you are after.
Lucene does support Payloads (i.e. per term storage in the index. See
the BoostingTermQuery in Lucene and the Token class setPayload()
method). However, this doesn't do much for you in Solr as of yet
without some work on your own. I think Tricia Williams has been
working on payloads and Solr, but I don't know that anything has been
posted. The tricky part, I believe, is how to handle indexing,
integrating the BoostingTermQuery isn't all that hard, I don't
think. Also note, there isn't anything in Solr preventing the use of
payloads, but there probably is a decent amount to do to hook them in.
HTH,
Grant
On Jun 5, 2008, at 4:52 PM, Andreas von Hessling wrote:
Hi there!
As a Solr newbie who has however worked with Lucene before, I have
an unusual question for the experts:
Question:
Can I, and if so, how do I perform index-time term boosting in
documents where each boost-value is not the same for all documents
(no global boosting of a given term) but instead can be per-
document? In other words: I understand there's a way to specify
term boost values for search queries, but is that also possible for
indexed documents?
Here's what I'm fundamentally trying to do:
I want to index and search over documents that have a special,
associative-array-like property:
Each document has a list of unique words, and each word has a
numeric value between 0 and 1. These values express similarity in
the dimensions with this word/name. For example "cat": 0.99 is
similar to "cat: 0.98", but not to "cat": 0.21. All documents have
the same set of words, and there are lots of them: about 1 million.
(If necessary, I can reduce the number of words to tens of
thousands, but then the documents would not share the same set of
words any more). Most of the word values for a typical document are
0.00.
Example:
Documents in the index:
d1:
cat: 0.99
dog: 0.42
car: 0.00
d2:
cat: 0.02
dog: 0.00
car: 0.00
Incoming search query (with these numeric term-boosts):
q:
cat: 0.99
dog: 0.11
car: 0.00 (not specified in query)
The ideal result would be that q matches d1 much more than d2.
Here's my analysis of my situation and potential solutions:
- because I have so many words, I cannot use a separate field for
each word, this would overload Solr/Lucene. This is unfortunate,
because I know there is index-time boosting on a per-field basis
(reference: http://wiki.apache.org/solr/SolrRelevancyFAQ#head-d846ae0059c4e6b7f0d0bb2547ac336a8f18ac2f)
, and because I could have used Function Queries (reference: http://wiki.apache.org/solr/FunctionQuery)
.
- As a (stupid) workaround, I could convert my documents to into
pure text: the numeric values would be translated from e.g. "cat":
0.99 to repeat the word "cat" 99 times. This would be done for a
particular document for all words and the text would be then used
for regular scoring in Solr. This approach seems doable, but
inefficient and far from elegant.
Am I reinventing the wheel here or is what I'm trying to do
something fundamentally different than what Solr and Lucene has to
offer?
Any comments highly appreciated. What can I do about this?
Thanks,
Andreas
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ