On Thu, Nov 12, 2009 at 3:00 PM, Stephen Duncan Jr <stephen.dun...@gmail.com
> wrote:

> On Thu, Nov 12, 2009 at 2:54 PM, Chris Hostetter <hossman_luc...@fucit.org
> > wrote:
>
>>
>> oh man, so you were parsing the Stored field values of every matching doc
>> at query time? ouch.
>>
>> Assuming i'm understanding your goal, the conventional way to solve this
>> type of problem is "payloads" ... you'll find lots of discussion on it in
>> the various Lucene mailing lists, and if you look online Michael Busch has
>> various slides that talk about using them.  they let you say things
>> like "in this document, at this postion of field 'x' the word 'microsoft'
>> is worth 37.4, but at this other position (or in this other document)
>> 'microsoft' is only worth 17.2"
>>
>> The simplest way to use them in Solr (as i understand it) is to use
>> soemthing like the DelimitedPayloadTokenFilterFactory when indexing, and
>> then write yourself
>> a simple little custom QParser that generates a BoostingTermQuery on your
>> field.
>>
>> should be a lot simpler to implement then the Query you are describing,
>> and much faster.
>>
>>
>> -Hoss
>>
>>
> Thanks. I finally got around to looking at this again today and was looking
> at a similar path, so I appreciate the confirmation.
>
>
> --
> Stephen Duncan Jr
> www.stephenduncanjr.com
>

For posterity, here's the rest of what I discovered trying to implement
this:

You'll need to write a PayloadSimilarity as described here:
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/(here's
my updated version due to deprecation of the method mentioned in
that article):

    @Override
    public float scorePayload(
        int docId,
        String fieldName,
        int start,
        int end,
        byte[] payload,
        int offset,
        int length)
    {
        // can ignore length here, because we know it is encoded as 4 bytes
        return PayloadHelper.decodeFloat(payload, offset);
    }

You'll need to register that similarity in your Solr schema.xml (was hard to
figure out, as I didn't realize that the similarity has to be applied
globally to the writer/search used generally, even though I only care about
payloads on one field, so I wasted time trying to figure out how to plug in
the similarity in my query parser).

You'll want to use the "payloads" type or something based on it that's in
the example schema.xml.

The latest and greatest query type to use is PayloadTermQuery.  I use it in
my custom query parser class, overriding getFieldQuery, checking for my
field name, and then:

 return new PayloadTermQuery(new Term(field, queryText),
                new AveragePayloadFunction());

Due to the global nature of the Similarity, I guess you'd have to modify it
to look at the field name and base behavior on that if you wanted different
kinds of payloads on different fields in one schema.

Also, whereas in my original implementation, I controlled the score
completely, and therefore if I set a score of 0.8, the doc came back as
score of 0.8, in this technique the payload is just used as a boost/addition
to the score, so my scores came out higher than before.  Since they're still
in the same relative order, that still satisfied my needs, but did require
updating my test cases.

-- 
Stephen Duncan Jr
www.stephenduncanjr.com

Reply via email to