I use the keyword tokenizer and then pattern replace to transform multi words into underscore connected tokens. For instance, "Burger Joint" transforms to "burger_joint" which then looks in my synonym filter for underscored synonyms. When it matches I then replace underscores with spaces or just toss over to the word delimiter filter factory before further processing
On Sun, Mar 26, 2017 at 11:53 AM Sanjana Sridhar < sanjana.srid...@wishabi.com> wrote: > Hello, > > Does anyone have a good solution for working with multi word synonyms? I've > been reading a lot about this online and haven't really found a great > solution to it. I use the SynonymFilterFactory at index time, but words > don't really get matched to the appropriate multi word synonyms, even > though using the Analysis tool shows that it should be matched. > > Examples: > > coke, coca cola > > > > This is the configuration I have on text fields: > > <fieldType name ="text_icu_english" class="solr.TextField" > positionIncrementGap="100" multiValued="true"> > <analyzer type="index"> > <!-- The white space tokenizer splits on white space but preserves > the tokens so that it can be used by the next filter --> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" ignoreCase="true" expand= > "true" synonyms="synonyms.txt" /> > <!-- This filter splits a word on punctuation, preserves the > original, concatenates the split words and also stems english possessive > nouns --> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="0" generateNumberParts = "0" > splitOnCaseChange = "0" preserveOriginal="1" catenateWords="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishMinimalStemFilterFactory"/> > <filter class="solr.ICUFoldingFilterFactory"/> > <filter class="solr.PatternReplaceFilterFactory" > pattern="(.*[\*].*)" replacement=""/> > <filter class="solr.TrimFilterFactory"/> > <filter class="solr.LengthFilterFactory" min="1" max="100"/> > <filter class="solr.ClassicFilterFactory"/> > > </analyzer> > <analyzer type="query"> > <!-- The white space tokenizer splits on white space but preserves > the tokens so that it can be used by the next filter --> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <!-- This filter splits a word on punctuation, preserves the > original, concatenates the split words and also stems english possessive > nouns --> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="0" generateNumberParts = "0" > splitOnCaseChange = "0" preserveOriginal="1" catenateWords="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishMinimalStemFilterFactory"/> > <filter class="solr.ICUFoldingFilterFactory"/> > <filter class="solr.ClassicFilterFactory"/> > </analyzer> > <similarity class="solr.BM25SimilarityFactory"> > <float name="b">0.0</float> > </similarity> > </fieldType> > > > Greatly appreciate any help ya'll can offer. > > Thanks, > Sanjana > > -- > IMPORTANT NOTICE: This message, including any attachments (hereinafter > collectively referred to as "Communication"), is intended only for the > addressee(s) > named above. This Communication may include information that is > privileged, confidential and exempt from disclosure under applicable law. > If the recipient of this Communication is not the intended recipient, or > the employee or agent responsible for delivering this Communication to the > intended recipient, you are notified that any dissemination, distribution > or copying of this Communication is strictly prohibited. If you have > received this Communication in error, please notify the sender immediately > by phone or email and permanently delete this Communication from your > computer without making a copy. Thank you. > -- -- *John Blythe* Product Manager & Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713