Thanks to both of you.

I understand from your replies that setting the payloads for terms (per-document) is easy and the BoostingTermQuery can be used to set payloads on the query side. Getting this to work in Solr would require significant work though. I wish I had the time to do that, but for my purposes I'll go with the suboptimal workaround of repeating words.

But let me emphasize how great payloads in Solr would be: It would open up many new options (as Grant mentions in http://lucene.grantingersoll.com/2007/03/18/payloads/ ). In particular, it would allow Solr to not just search over mere text documents to search over any object that can be described numerical feature values! Similar to a general-purpose classifier. That is, indexed items become training examples, each of which is described by a set of features (words) and their corresponding numerical value. Queries can then be seen as testing examples, who query the knowledge base in a case-based reasoning (CBR) manner. Solr would be an extremely scalable classifier with easy setup, convenient interfaces and simple ways to change the classification function (change the similarity/ranking function). From an AI perspective, this would be huge!

Just a thought.

Andreas





Tricia Williams wrote:
Payloads could be the answer but I don't think there is any cross over into what I've been working on with Payloads (https://issues.apache.org/jira/browse/SOLR-380 has what I last posted which is pretty much what we're using now. I've also posted related SOLR-532 and SOLR-522).

What you would have to do is write a custom Tokenizer or TokenFilter which takes your input, breaks into tokens and then adds the numeric value as a payload. Assuming your input is actually something like:
cat:0.99 dog:0.42 car:0.00
you could write a TokenFilter which builds on the WhitespaceTokenizer to break each token on ":" using the first part as the token value and the second part as the token's payload. I think the APIs are pretty clear if you are looking for help.

I haven't looked at all at how you can query/boost using payloads, but if Grant says that integrating the BoostingTermQuery isn't all that hard I would believe him.

Good Luck,
Tricia

Grant Ingersoll wrote:
Hmmm, if I understand your question correctly, I think Lucene's payloads are what you are after.

Lucene does support Payloads (i.e. per term storage in the index. See the BoostingTermQuery in Lucene and the Token class setPayload() method). However, this doesn't do much for you in Solr as of yet without some work on your own. I think Tricia Williams has been working on payloads and Solr, but I don't know that anything has been posted. The tricky part, I believe, is how to handle indexing, integrating the BoostingTermQuery isn't all that hard, I don't think. Also note, there isn't anything in Solr preventing the use of payloads, but there probably is a decent amount to do to hook them in.

HTH,
Grant



On Jun 5, 2008, at 4:52 PM, Andreas von Hessling wrote:

Hi there!
As a Solr newbie who has however worked with Lucene before, I have an unusual question for the experts:

Question:

Can I, and if so, how do I perform index-time term boosting in documents where each boost-value is not the same for all documents (no global boosting of a given term) but instead can be per-document? In other words: I understand there's a way to specify term boost values for search queries, but is that also possible for indexed documents?


Here's what I'm fundamentally trying to do:

I want to index and search over documents that have a special, associative-array-like property: Each document has a list of unique words, and each word has a numeric value between 0 and 1. These values express similarity in the dimensions with this word/name. For example "cat": 0.99 is similar to "cat: 0.98", but not to "cat": 0.21. All documents have the same set of words, and there are lots of them: about 1 million. (If necessary, I can reduce the number of words to tens of thousands, but then the documents would not share the same set of words any more). Most of the word values for a typical document are 0.00.
Example:
Documents in the index:
d1:
cat: 0.99
dog: 0.42
car: 0.00

d2:
cat: 0.02
dog: 0.00
car: 0.00

Incoming search query (with these numeric term-boosts):
q:
cat: 0.99
dog: 0.11
car: 0.00 (not specified in query)

The ideal result would be that q matches d1 much more than d2.


Here's my analysis of my situation and potential solutions:

- because I have so many words, I cannot use a separate field for each word, this would overload Solr/Lucene. This is unfortunate, because I know there is index-time boosting on a per-field basis (reference: http://wiki.apache.org/solr/SolrRelevancyFAQ#head-d846ae0059c4e6b7f0d0bb2547ac336a8f18ac2f), and because I could have used Function Queries (reference: http://wiki.apache.org/solr/FunctionQuery). - As a (stupid) workaround, I could convert my documents to into pure text: the numeric values would be translated from e.g. "cat": 0.99 to repeat the word "cat" 99 times. This would be done for a particular document for all words and the text would be then used for regular scoring in Solr. This approach seems doable, but inefficient and far from elegant.


Am I reinventing the wheel here or is what I'm trying to do something fundamentally different than what Solr and Lucene has to offer?

Any comments highly appreciated.  What can I do about this?


Thanks,

Andreas

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








Reply via email to