: I'm working on using trigrams for similarity matching on some data, : where there's a canonical name and lots of personalised variants, e.g.: : : canonical: "My Wonderful Thing" : variant: "My Wonderful Thing (for Matt Patterson)"
I'm really not sure why you would need trigrams for something like this ... just doing something basic like whitespace tokenization and using length norms should allow any of these queries... q=My Wonderful Thing ...basic 3 clause term query q="My Wonderful Thing" ...strict phrase query q="My Wonderful Thing"~5 ...sloppy phrase query, alowing other words mixed in ...to match both docs, with the canonical version scoring higher because of it's length. : I really want the canonical version to be returned first in the results : list, and the setup I have now is returning results like: : : * "My Wonderful Thing (for Somebody Else)" : * "My Wonderful Thing (for Yet Another Person)" : * "My Wonderful Thing" what exactly does your query look like? are you doing a quoted phrase search? what does the debugQuery ouput tell you about those matches? The order you are getting probably because the variants are getting additional matches for some of the trigrams in the various names of people. I don't see any specific cases in those contrived examples but for instance "My Wonderful Thing (for Raymond Fuller)" will match a basic query for the trigrams better then just "My Wonderful Thing" because "mon" and "ful" appears twice in the variant but only once each in the canonical. (this is why real examples are critical) -Hoss