[ 
https://issues.apache.org/jira/browse/LUCENE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222651#comment-17222651
 ] 

Nguyen Minh Gia Huy commented on LUCENE-9588:
---------------------------------------------

My original statement *_a Tokenizer invoke incrementToken on another 
tokenfilter_* could be misleading. To make it clear, it may invoke 
incrementToken on another *Tokenizer.*

The existing sub-classes of SegmentingTokenizerBase handle the word 
segmentation without having to be aware of I/O exception but it's not always 
the case. Word segmentation sometimes requires I/O-aware e.g. tokenize a 
japanese sentence using 
[JapaneseTokenizer|https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L526]

Additionally, this method 
[incrementSentence|https://github.com/apache/lucene-solr/blob/9ce4b98af2155ba9d6d41e12ff12017c557a9ea4/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/SegmentingTokenizerBase.java#L174-L195]
 is currently defined to throw IO exception but none of the statement inside it 
throw IO exception. Isn't it a signal that either (1) IO exception is 
unnecessary for *incrementSentence* or (2) *setNextSentence* and 
*incrementWord* should throw IO exception ?

> Exceptions handling in methods of SegmentingTokenizerBase
> ---------------------------------------------------------
>
>                 Key: LUCENE-9588
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9588
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 8.6.3
>            Reporter: Nguyen Minh Gia Huy
>            Priority: Minor
>
> The current interface of *setNextSentence* and *i**ncrementWord* methods in 
> *SegmentingTokenizerBase* do not define the checked exceptions, which makes 
> it troublesome to be inherited.
>  For example, if we override the _incrementWord_  with a logic that invoke  
> _incrementToken_ on another token filter, the _incrementToken_ raises the 
> _IOException_ but the _incrementWord_ is not defined to handle it. 
> I think having _setNextSentence_ and _incrementWord_ handle the IOException 
> would make the *SegmentingTokenizerBase* easier to be used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to