[jira] [Commented] (LUCENE-10008) CommonGramsFilterFactory doesn't respect ignoreCase=true when default stopwords are used

Chris M. Hostetter (Jira) Thu, 17 Jun 2021 14:49:10 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365142#comment-17365142
 ]


Chris M. Hostetter commented on LUCENE-10008:
---------------------------------------------

{quote}Should we add a new base class common for {{Stop/KeepWord/CommonGrams}} 
to parse these args ...
{quote}
yeah ... that's what that comment is suggesting: a new (abstract) base class 
injected into the hierarchy that can be shared by those 3 concrete classes as a 
common parent. 

something like...
{code:java}
public abstract class AbstractWordsFileFilterFactory extends TokenFilterFactory 
implements ResourceLoaderAware {
  private CharArraySet words; // nocommit: also provide public accessor
  private final String wordFiles; // nocommit: also provide public accessor
  private final String format; // nocommit: also provide public accessor
  private final boolean ignoreCase; // nocommit: also provide public accessor

  // nocommit: jdocs
  public AbstractWordsFileFilterFactory(Map<String, String> args) {
    super(args);
    wordFiles = get(args, "words");
    format = get(args, "format");
    ignoreCase = getBoolean(args, "ignoreCase", false);
  }

  // nocommit: jdocs
  @Override
  public void inform(ResourceLoader loader) throws IOException {
    // nocommit: mostly verbatim from current StopFilterFactory
    // nocommit: but replace direct use of ENGLISH_STOP_WORDS_SET in "default" 
codepath with...
    // ... } else { ...; return createDefaultWords(); }
  }
  // nocommit: jdocs
  protected CharArraySet createDefaultWords() {
    // nocommit: KeepWordFilterFactory should override this method to return 
null
    return new CharArraySet(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET, ignoreCase)
  } 
}{code}
 

 

> CommonGramsFilterFactory doesn't respect ignoreCase=true when default 
> stopwords are used
> ----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-10008
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10008
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> CommonGramsFilterFactory's use of the "words" and "ignoreCase" config options 
> is inconsistent with how StopFilterFactory uses them - leading to 
> "ignoreCase=true" not being respected unless "words" is specified...
> StopFilterFactory...
> {code:java}
>   public void inform(ResourceLoader loader) throws IOException {
>     if (stopWordFiles != null) {
>       ...
>     } else {
>       ...
>       stopWords = new CharArraySet(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET, 
> ignoreCase);
>     }
>   }
> {code}
> CommonGramsFilterFactory...
> {code:java}
>   @Override
>   public void inform(ResourceLoader loader) throws IOException {
>     if (commonWordFiles != null) {
>       ...
>     } else {
>       commonWords = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
>     }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-10008) CommonGramsFilterFactory doesn't respect ignoreCase=true when default stopwords are used

Reply via email to