[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

GitBox Wed, 21 Sep 2022 08:05:24 -0700


kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1253843509

> Hi @kotman12 . Sorry for the delay. I'm not that familiar with this part
of the codebase but I think I see what's happening and how you managed to fix
it. Looks good to me. It'd be good to run this pipeline over a larger text base
to make sure there are no surprises and regressions.

Hi @dweiss .. no worries, thanks for taking a look.

I have made the changes you suggested. I agree that it would be good to test
this with a large and more varied text base. I did incorporate these changes
separately as a patch in my current project which is how I found that my
initial proposal actually had a regression (it threw an NPE in case of empty
input). It's possible there other bugs that I haven't uncovered .. did you have
any particular test dataset in mind?

-------------------------------------------------------------------------------------------------------------------------------------------
Totally separately I noticed that in the case of stacking two or more
sentence-aware token filters, sentence token iteration and attribute cloning
are performed separately across the different filters (this is the case in the
current implementation as well). This work could probably be done in a
dedicated `SentenceExtractionFilter` which could then pass a reference to a
_single_ `sentenceTokenAttributes` to all downstream filters that rely on
sentence-level analysis (perhaps such a reference could be stashed in the
proposed `SentenceAttribute`). Also, following this reasoning a bit further, it
is possible to conceive future implementations that rely on passing an
arbitrary subset of adjacent tokens together to some analysis function, in
which case the use of the term `Sentence` would become a misnomer.

However, addressing these issues would be more effort so I wanted to check
if it is even worth it to explore. I am starting to suspect that this package
is not really used a lot because otherwise one would expect these bugs to have
been caught sooner given the configuration is in the documentation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

Reply via email to