Hello, I'm working on using trigrams for similarity matching on some data, where there's a canonical name and lots of personalised variants, e.g.:
canonical: "My Wonderful Thing" variant: "My Wonderful Thing (for Matt Patterson)" Using the pg_trgm (http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.1#Extensions) index type and the K-Nearest-Neighbour operator in Postgres 9.1 I get pretty good results, and I want to do something similar using Solr - for one it feels like there's a lot more room to tweak and optimise this than with Postgres. Being new to Solr, I'm a little unsure about exactly what to do. I've set up a test Solr instance using a configuration like this: https://gist.github.com/1391468. This is working, in as much as it's returning results, but the data set I'm working with is somewhat polluted, and even with regular manual cleaning probably will always be a bit polluted. So, we have names in the data like: "My Wonderful Thing" "My Wonderful Thing (for Somebody Else)" "My Wonderful Thing (for Yet Another Person)" I really want the canonical version to be returned first in the results list, and the setup I have now is returning results like: * "My Wonderful Thing (for Somebody Else)" * "My Wonderful Thing (for Yet Another Person)" * "My Wonderful Thing" * "Other name with Wonderful or Thing in it" With the Postgres pg_trgm index and <-> K-NN operator I get results like * "My Wonderful Thing" * "My Wonderful Thing (for Somebody Else)" * "My Wonderful Thing (for Yet Another Person)" * "Other name with Wonderful or Thing in it" Which is better, and I guess the difference is to do with the way that the distance between search term and results are calculated. So, is there something I can do to change the way ranking is calculated? Also, is there a good place to start reading about this kind of similarity searching and Solr? Everything I've looked at so far seems to cover this kind of n-gram approach very lightly at best. Thanks, Matt