kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1253843509

   > Hi @kotman12 . Sorry for the delay. I'm not that familiar with this part 
of the codebase but I think I see what's happening and how you managed to fix 
it. Looks good to me. It'd be good to run this pipeline over a larger text base 
to make sure there are no surprises and regressions.
   
   Hi @dweiss  .. no worries, thanks for taking a look.
   
   I have made the changes you suggested. I agree that it would be good to test 
this with a large and more varied text base. I did incorporate these changes 
separately as a patch in my current project which is how I found that my 
initial proposal actually had a regression (it threw an NPE in case of empty 
input). It's possible there other bugs that I haven't uncovered .. did you have 
any particular test dataset in mind?
   
   
-------------------------------------------------------------------------------------------------------------------------------------------
   Totally separately I noticed that in the case of stacking two or more 
sentence-aware token filters, sentence token iteration and attribute cloning 
are performed separately across the different filters (this is the case in the 
current implementation as well). This work could probably be done in a 
dedicated `SentenceExtractionFilter` which could then pass a reference to a 
_single_ `sentenceTokenAttributes` to all downstream filters that rely on 
sentence-level analysis (perhaps such a reference could be stashed in the 
proposed `SentenceAttribute`). Also, following this reasoning a bit further, it 
is possible to conceive future implementations that rely on passing an 
arbitrary subset of adjacent tokens together to some analysis function, in 
which case the use of the term `Sentence` would become a misnomer. 
   
   However, addressing these issues would be more effort so I wanted to check 
if it is even worth it to explore. I am starting to suspect that this package 
is not really used a lot because otherwise one would expect these bugs to have 
been caught sooner given the configuration is in the documentation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to