[GitHub] [lucene] rmuir commented on a change in pull request #84: LUCENE-9929 Make ScandinavianNormalizationFilter configurable wrt fol…

GitBox Wed, 14 Apr 2021 14:02:13 -0700


rmuir commented on a change in pull request #84:
URL: https://github.com/apache/lucene/pull/84#discussion_r613583132




##########
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.java
##########
@@ -33,14 +34,45 @@
  * <p>blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not 
blabarsyltetoj räksmörgås ==
  * ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas
  *
+ * <p>You can choose which of the foldings to apply (aa, ao, ae, oe, oo) 
through a parameter.
+ *
  * @see ScandinavianFoldingFilter
  */
 public final class ScandinavianNormalizationFilter extends TokenFilter {
 
+  /**
+   * Create the filter with default folding rules, backward compatible with 
all earlier versions
+   *
+   * @param input the TokenStream
+   */
   public ScandinavianNormalizationFilter(TokenStream input) {
     super(input);
+    this.foldings = ALL_FOLDINGS;
   }
 
+  /**
+   * Create the filter using custom folding rules.
+   *
+   * @param input the TokenStream
+   * @param foldings a Set of Foldings to apply (i.e. AE, OE, AA, AO, OO)
+   */
+  public ScandinavianNormalizationFilter(TokenStream input, Set<Foldings> 
foldings) {

Review comment:
       I still don't like this API to the end user. End user may not know which 
of these are appropriate for each language. Please, see what I stated on the 
JIRA issue. It isn't breaking any api to expose Norwegian/Swedish/Danish 
filters. You also don't have to remove the existing Scandinavian one that does 
all foldings. Nor do you have to duplicate huge chunks of code!
   
   Personally, I would move logic into `ScandinavianNormalizer(Set<Foldings>)` 
helper that gets used by:
   * existing ScandinavianNormalizationFilter, it just creates `new 
ScandinavianNormalizer(ALL)` and uses it
   * NorwegianNormalizationFilter, creates `new ScandinanvianNormalizer(???)` 
and uses it
   * SwedishNormaliationFilter, creates `new ScandinanvianNormalizer(???)` and 
uses it
   * DanishNormalizatIonFilter, creates `new ScandinanvianNormalizer(???)` and 
uses it
   
   This way, all 4 filters and their factories are parameter-free. Nobody needs 
to know anything about how these languages work in order to do the "right" 
thing, e.g. if they have some norwegian text, they just use the norwegian one, 
even if they don't have a clue about norwegian orthography.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #84: LUCENE-9929 Make ScandinavianNormalizationFilter configurable wrt fol…

Reply via email to