[
https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645376#comment-17645376
]
ASF GitHub Bot commented on OPENNLP-1357:
-----------------------------------------
mawiesne opened a new pull request, #453:
URL: https://github.com/apache/opennlp/pull/453
Change
-
- adjusts method signatures in `SentenceDetector` and `EndOfSentenceScanner`
to use CharSequence` as proposed by reporter 'P. Austin'
- adapts existing impl classes to work (fine) with this change, see comments
in OPENNLP-1357
- adjusts JavaDoc accordingly
- adds 'Override' annotations in some spots where they were missing
Tasks
-
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
### For all changes:
- [x] Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
- [x] Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA
number you are trying to resolve? Pay particular attention to the hyphen "-"
character.
- [x] Has your PR been rebased against the latest commit within the target
branch (typically master)?
- [x] Is your initial contribution a single, squashed commit?
### For code changes:
- [x] Have you ensured that the full suite of tests is executed via mvn
clean install at the root opennlp folder?
- [ ] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the LICENSE file, including the main
LICENSE file in opennlp folder?
- [ ] If applicable, have you updated the NOTICE file, including the main
NOTICE file found in opennlp folder?
### For documentation related changes:
- [x] Have you ensured that format looks appropriate for the output in which
it is rendered?
### Note:
Please ensure that once the PR is submitted, you check GitHub Actions for
build issues and submit an update to your PR as soon as possible.
> Use CharSequence to allow for memory management
> -----------------------------------------------
>
> Key: OPENNLP-1357
> URL: https://issues.apache.org/jira/browse/OPENNLP-1357
> Project: OpenNLP
> Issue Type: New Feature
> Components: Sentence Detector
> Affects Versions: 1.9.4
> Reporter: Paul Austin
> Priority: Minor
>
> Most of the classes in OpenNLP require the inputs to be as String,
> StringBuffer, or char[]. This means that you have to load all the data into
> memory.
> Many of these cases (String and StringBuffer args) could be replaced with a
> single method that accepts CharSequence as a parameter.
> For example DefaultEndOfSentenceScanner
>
> {code:java}
> public List<Integer> getPositions(CharSequence s) {
> List<Integer> l = new ArrayList<>();
> for (int i = 0; i < s.length(); i++) {
> char c = s.charAt(i);
> if (eosCharacters.contains(c)) {
> l.add(i);
> }
> }
> return l;
> }
> {code}
> This would allow for users to manage the memory overhead for large data sets.
> And in some cases require less temporary memory conversion to char buffers.
> Some code such as the SDContextGenerator already uses CharSequence. However
> in SentenceDetectorME there is an unnecessary conversion to a StringBuffer.
> The sb isn't modified and the SDContextGenerator.getContext takes
> CharSequence as an arg and String is a CharSequence.
>
> {code:java}
> public Span[] sentPosDetect(String s) {
> sentProbs.clear();
> StringBuffer sb = new StringBuffer(s);{code}
>
> I can create a pull request(s) for the above if you think it is useful.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)