rmuir commented on a change in pull request #84:
URL: https://github.com/apache/lucene/pull/84#discussion_r613583132
##########
File path:
lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.java
##########
@@ -33,14 +34,45 @@
* <p>blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not
blabarsyltetoj räksmörgås ==
* ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas
*
+ * <p>You can choose which of the foldings to apply (aa, ao, ae, oe, oo)
through a parameter.
+ *
* @see ScandinavianFoldingFilter
*/
public final class ScandinavianNormalizationFilter extends TokenFilter {
+ /**
+ * Create the filter with default folding rules, backward compatible with
all earlier versions
+ *
+ * @param input the TokenStream
+ */
public ScandinavianNormalizationFilter(TokenStream input) {
super(input);
+ this.foldings = ALL_FOLDINGS;
}
+ /**
+ * Create the filter using custom folding rules.
+ *
+ * @param input the TokenStream
+ * @param foldings a Set of Foldings to apply (i.e. AE, OE, AA, AO, OO)
+ */
+ public ScandinavianNormalizationFilter(TokenStream input, Set<Foldings>
foldings) {
Review comment:
I still don't like this API to the end user. End user may not know which
of these are appropriate for each language. Please, see what I stated on the
JIRA issue. It isn't breaking any api to expose Norwegian/Swedish/Danish
filters. You also don't have to remove the existing Scandinavian one that does
all foldings. Nor do you have to duplicate huge chunks of code!
Personally, I would move logic into `ScandinavianNormalizer(Set<Foldings>)`
helper that gets used by:
* existing ScandinavianNormalizationFilter, it just creates `new
ScandinavianNormalizer(ALL)` and uses it
* NorwegianNormalizationFilter, creates `new ScandinanvianNormalizer(???)`
and uses it
* SwedishNormaliationFilter, creates `new ScandinanvianNormalizer(???)` and
uses it
* DanishNormalizatIonFilter, creates `new ScandinanvianNormalizer(???)` and
uses it
This way, all 4 filters and their factories are parameter-free. Nobody needs
to know anything about how these languages work in order to do the "right"
thing, e.g. if they have some norwegian text, they just use the norwegian one,
even if they don't have a clue about norwegian orthography.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]