: I have a field that I use for facetting. I do not tokenize this field. It
: has entries like:
:
: AWB artikel 2, lid 1
: AWB artikel 8:75
: Algemene Wet Bestuursrecht artikel 8:75
I assume those are names of laws, followed by page/paragram numbers in
various formats? (and evidently "lid" is dutch for "section" ?)
: a facet for each law, instead for each pragraph of the law. I tried to do
: this with a SynonymFilterFactory using rules like
...:
: But that doesn't work. And even if it would work, it would not be a good
: solution, since I will never be able to come up with a complete list, as
: long as I cannot use wildcards.
i don't know enough about your source data to know all the posible
permutations you have to deal with, but i would tackle this with something
like...
* KeywordTokenizerFactory
* PatternReplaceFilterFactory
- regex to strip off any \d+:\d+ at the end of tokens
* PatternReplaceFilterFactory
- regex to strip off any \d+,\s+lid\s+\d+ at the end of tokens
* PatternReplaceFilterFactory
- regex to strip off "\s+artikel" from the end of docs
* TrimFilterFacotry
* SynonymFilterFactory
- mapping things lke "Algemene Wet Bestuursrecht" to "AWB"
-Hoss