: I have a field that I use for facetting.  I do not tokenize this field. It
: has entries like:
: 
: AWB artikel 2, lid 1
: AWB artikel 8:75
: Algemene Wet Bestuursrecht artikel 8:75

I assume those are names of laws, followed by page/paragram numbers in 
various formats? (and evidently "lid" is dutch for "section" ?)

: a facet for each law, instead for each pragraph of the law. I tried to do
: this with a SynonymFilterFactory using rules like
        ...: 
: But that doesn't work. And even if it would work, it would not be a good
: solution, since I will never be able to come up with a complete list, as
: long as I cannot use wildcards.

i don't know enough about your source data to know all the posible 
permutations you have to deal with, but i would tackle this with something 
like...

 * KeywordTokenizerFactory
 * PatternReplaceFilterFactory 
   - regex to strip off any \d+:\d+ at the end of tokens
 * PatternReplaceFilterFactory 
   - regex to strip off any \d+,\s+lid\s+\d+ at the end of tokens
 * PatternReplaceFilterFactory 
   - regex to strip off "\s+artikel" from the end of docs
 * TrimFilterFacotry
 * SynonymFilterFactory
   - mapping things lke "Algemene Wet Bestuursrecht" to "AWB"

-Hoss

Reply via email to