kotman12 commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1253843509
> Hi @kotman12 . Sorry for the delay. I'm not that familiar with this part of the codebase but I think I see what's happening and how you managed to fix it. Looks good to me. It'd be good to run this pipeline over a larger text base to make sure there are no surprises and regressions. Hi @dweiss .. no worries, thanks for taking a look. I have made the changes you suggested. I agree that it would be good to test this with a large and more varied text base. I did incorporate these changes separately as a patch in my current project which is how I found that my initial proposal actually had a regression (it threw an NPE in case of empty input). It's possible there other bugs that I haven't uncovered .. did you have any particular test dataset in mind? ------------------------------------------------------------------------------------------------------------------------------------------- Totally separately I noticed that in the case of stacking two or more sentence-aware token filters, sentence token iteration and attribute cloning are performed separately across the different filters (this is the case in the current implementation as well). This work could probably be done in a dedicated `SentenceExtractionFilter` which could then pass a reference to a _single_ `sentenceTokenAttributes` to all downstream filters that rely on sentence-level analysis (perhaps such a reference could be stashed in the proposed `SentenceAttribute`). Also, following this reasoning a bit further, it is possible to conceive future implementations that rely on passing an arbitrary subset of adjacent tokens together to some analysis function, in which case the use of the term `Sentence` would become a misnomer. However, addressing these issues would be more effort so I wanted to check if it is even worth it to explore. I am starting to suspect that this package is not really used a lot because otherwise one would expect these bugs to have been caught sooner given the configuration is in the documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org