: I have a field that I use for facetting. I do not tokenize this field. It : has entries like: : : AWB artikel 2, lid 1 : AWB artikel 8:75 : Algemene Wet Bestuursrecht artikel 8:75
I assume those are names of laws, followed by page/paragram numbers in various formats? (and evidently "lid" is dutch for "section" ?) : a facet for each law, instead for each pragraph of the law. I tried to do : this with a SynonymFilterFactory using rules like ...: : But that doesn't work. And even if it would work, it would not be a good : solution, since I will never be able to come up with a complete list, as : long as I cannot use wildcards. i don't know enough about your source data to know all the posible permutations you have to deal with, but i would tackle this with something like... * KeywordTokenizerFactory * PatternReplaceFilterFactory - regex to strip off any \d+:\d+ at the end of tokens * PatternReplaceFilterFactory - regex to strip off any \d+,\s+lid\s+\d+ at the end of tokens * PatternReplaceFilterFactory - regex to strip off "\s+artikel" from the end of docs * TrimFilterFacotry * SynonymFilterFactory - mapping things lke "Algemene Wet Bestuursrecht" to "AWB" -Hoss