Re: An unusual question for the experts -- term boosting for individual documents?

Tricia Williams Fri, 06 Jun 2008 07:17:10 -0700

Payloads could be the answer but I don't think there is any cross overinto what I've been working on with Payloads(https://issues.apache.org/jira/browse/SOLR-380 has what I last postedwhich is pretty much what we're using now. I've also posted relatedSOLR-532 and SOLR-522).

What you would have to do is write a custom Tokenizer or TokenFilterwhich takes your input, breaks into tokens and then adds the numericvalue as a payload. Assuming your input is actually something like:

cat:0.99 dog:0.42 car:0.00

you could write a TokenFilter which builds on the WhitespaceTokenizer tobreak each token on ":" using the first part as the token value and thesecond part as the token's payload. I think the APIs are pretty clearif you are looking for help.

I haven't looked at all at how you can query/boost using payloads, butif Grant says that integrating the BoostingTermQuery isn't all that hardI would believe him.


Good Luck,
Tricia

Grant Ingersoll wrote:

Hmmm, if I understand your question correctly, I think Lucene'spayloads are what you are after.
Lucene does support Payloads (i.e. per term storage in the index. Seethe BoostingTermQuery in Lucene and the Token class setPayload()method). However, this doesn't do much for you in Solr as of yetwithout some work on your own. I think Tricia Williams has beenworking on payloads and Solr, but I don't know that anything has beenposted. The tricky part, I believe, is how to handle indexing,integrating the BoostingTermQuery isn't all that hard, I don'tthink. Also note, there isn't anything in Solr preventing the use ofpayloads, but there probably is a decent amount to do to hook them in.
HTH,
Grant



On Jun 5, 2008, at 4:52 PM, Andreas von Hessling wrote:
Hi there!
As a Solr newbie who has however worked with Lucene before, I have anunusual question for the experts:
Question:
Can I, and if so, how do I perform index-time term boosting indocuments where each boost-value is not the same for all documents(no global boosting of a given term) but instead can beper-document? In other words: I understand there's a way to specifyterm boost values for search queries, but is that also possible forindexed documents?
Here's what I'm fundamentally trying to do:
I want to index and search over documents that have a special,associative-array-like property:Each document has a list of unique words, and each word has a numericvalue between 0 and 1. These values express similarity in thedimensions with this word/name. For example "cat": 0.99 is similarto "cat: 0.98", but not to "cat": 0.21. All documents have the sameset of words, and there are lots of them: about 1 million. (Ifnecessary, I can reduce the number of words to tens of thousands,but then the documents would not share the same set of words anymore). Most of the word values for a typical document are 0.00.
Example:
Documents in the index:
d1:
cat: 0.99
dog: 0.42
car: 0.00

d2:
cat: 0.02
dog: 0.00
car: 0.00

Incoming search query (with these numeric term-boosts):
q:
cat: 0.99
dog: 0.11
car: 0.00 (not specified in query)

The ideal result would be that q matches d1 much more than d2.


Here's my analysis of my situation and potential solutions:
- because I have so many words, I cannot use a separate field foreach word, this would overload Solr/Lucene. This is unfortunate,because I know there is index-time boosting on a per-field basis(reference:http://wiki.apache.org/solr/SolrRelevancyFAQ#head-d846ae0059c4e6b7f0d0bb2547ac336a8f18ac2f),and because I could have used Function Queries (reference:http://wiki.apache.org/solr/FunctionQuery).- As a (stupid) workaround, I could convert my documents to into puretext: the numeric values would be translated from e.g. "cat": 0.99 torepeat the word "cat" 99 times. This would be done for a particulardocument for all words and the text would be then used for regularscoring in Solr. This approach seems doable, but inefficient and farfrom elegant.
Am I reinventing the wheel here or is what I'm trying to do somethingfundamentally different than what Solr and Lucene has to offer?
Any comments highly appreciated.  What can I do about this?


Thanks,

Andreas
--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: An unusual question for the experts -- *term* boosting for individual documents?

Reply via email to

Re: An unusual question for the experts -- term boosting for individual documents?