Hello,

I'm working on using trigrams for similarity matching on some data, where 
there's a canonical name and lots of personalised variants, e.g.:

canonical: "My Wonderful Thing"
variant: "My Wonderful Thing (for Matt Patterson)"

Using the pg_trgm 
(http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.1#Extensions) index 
type and the K-Nearest-Neighbour operator in Postgres 9.1 I get pretty good 
results, and I want to do something similar using Solr - for one it feels like 
there's a lot more room to tweak and optimise this than with Postgres. Being 
new to Solr, I'm a little unsure about exactly what to do. I've set up a test 
Solr instance using a configuration like this: https://gist.github.com/1391468.

This is working, in as much as it's returning results, but the data set I'm 
working with is somewhat polluted, and even with regular manual cleaning 
probably will always be a bit polluted. So, we have names in the data like:

"My Wonderful Thing"
"My Wonderful Thing (for Somebody Else)"
"My Wonderful Thing (for Yet Another Person)"

I really want the canonical version to be returned first in the results list, 
and the setup I have now is returning results like:

* "My Wonderful Thing (for Somebody Else)"
* "My Wonderful Thing (for Yet Another Person)"
* "My Wonderful Thing"
* "Other name with Wonderful or Thing in it"

With the Postgres pg_trgm index and <-> K-NN operator I get results like

* "My Wonderful Thing"
* "My Wonderful Thing (for Somebody Else)"
* "My Wonderful Thing (for Yet Another Person)"
* "Other name with Wonderful or Thing in it"

Which is better, and I guess the difference is to do with the way that the 
distance between search term and results are calculated. 

So, is there something I can do to change the way ranking is calculated? Also, 
is there a good place to start reading about this kind of similarity searching 
and Solr?  Everything I've looked at so far seems to cover this kind of n-gram 
approach very lightly at best.

Thanks,

Matt

Reply via email to