On 6/25/06, Eric Jain <[EMAIL PROTECTED]> wrote:
I'd like to have "PowerShot", "powershot" and "power-shot" match each other. Solr has a WordDelimiterFilter, which works quite well, except that "powershot" still won't match "PowerShot" (tokenized into "power (shot powershot)", so "power powershot" would match..."). Any suggestions?
You mean if the indexed text was "powershot" and the query text was "PowerShot" then it wouldn't match (but the reverse case will). That is a problem... if one does both catenation and splitting on the query side, you end up with "Power" in the first position, and both "Shot" and "PowerShot" in the second. While this works fine for the indexing side, on the query side it's interpreted as a MultiPhraseQuery meaning "Power" followed by either "Shot" or "PowerShot". Workarounds: 1) a new QueryParser smart enough to make a boolean query instead of a MultiPhraseQuery. "Power Shot" OR "PowerShot" 2) index the field a second time via copyField, but have the query analyzer catenate instead of split subwords. query across both fields. 3) do more client-side processing... change "PowerShot" to "PowerShot" OR "powershot" (i.e. create a boolean query with the second option removing subword delimiters yourself). (1) is much harder to do in a generic way, but would be most useful. (2) is much easier and can be done now. -Yonik