[
https://issues.apache.org/jira/browse/OPENNLP-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839131#comment-17839131
]
ASF GitHub Bot commented on OPENNLP-1554:
-----------------------------------------
kinow commented on code in PR #597:
URL: https://github.com/apache/opennlp/pull/597#discussion_r1573009751
##########
opennlp-tools/src/test/java/opennlp/tools/sentdetect/AbstractSentenceDetectorTest.java:
##########
@@ -30,13 +30,16 @@
public abstract class AbstractSentenceDetectorTest {
- protected static final Locale LOCALE_SPANISH = new Locale("es");
+ protected static final Locale LOCALE_DUTCH = new Locale("nl");
protected static final Locale LOCALE_POLISH = new Locale("pl");
protected static final Locale LOCALE_PORTUGUESE = new Locale("pt");
+ protected static final Locale LOCALE_SPANISH = new Locale("es");
Review Comment:
The small details that normally pass unnoticed! Thanks for sorting
alphabetically :clap:
##########
opennlp-tools/src/test/java/opennlp/tools/sentdetect/SentenceDetectorMEDutchTest.java:
##########
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect;
+
+import java.io.IOException;
+
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import opennlp.tools.dictionary.Dictionary;
+
+/**
+ * Tests for the {@link SentenceDetectorME} class.
+ * <p>
+ * Demonstrates OPENNLP-1554.
+ * <p>
+ * In this context, well-known known Dutch (nl_NL) abbreviations must be
respected,
+ * so that words abbreviated with one or more '.' characters do not
+ * result in incorrect sentence boundaries.
+ * <p>
+ * See:
+ * <a
href="https://issues.apache.org/jira/projects/OPENNLP/issues/OPENNLP-1554">OPENNLP-1554</a>
+ */
+public class SentenceDetectorMEDutchTest extends AbstractSentenceDetectorTest {
+
+ private static final char[] EOS_CHARS = {'.', '?', '!'};
+
+ private static SentenceModel sentdetectModel;
+
+ @BeforeAll
+ public static void prepareResources() throws IOException {
+ Dictionary abbreviationDict = loadAbbDictionary(LOCALE_DUTCH);
+ SentenceDetectorFactory factory = new SentenceDetectorFactory(
+ "dut", true, abbreviationDict, EOS_CHARS);
+ sentdetectModel = train(factory, LOCALE_DUTCH);
+ Assertions.assertNotNull(sentdetectModel);
+ Assertions.assertEquals("dut", sentdetectModel.getLanguage());
+ }
+
+ // Example taken from 'Sentences_NL.txt'
+ @Test
+ void testSentDetectWithInlineAbbreviationsEx1() {
+ final String sent1 = "Een droom, tot de vorming waarvan een bijzonder
sterke compressie " +
+ "heeft bijgedragen, zal het meest gunstige materiaal zijn voor dit
onderzoek.";
+ // Here we have one abbreviations "p." => pagina (page)
+ final String sent2 = "Ik kies voor de droom van de botanische monografie
die " +
+ "op p. 183 en volgende wordt beschreven.";
+
+ SentenceDetectorME sentDetect = new SentenceDetectorME(sentdetectModel);
+ String sampleSentences = sent1 + " " + sent2;
+ String[] sents = sentDetect.sentDetect(sampleSentences);
+ Assertions.assertEquals(2, sents.length);
+ Assertions.assertEquals(sent1, sents[0]);
+ Assertions.assertEquals(sent2, sents[1]);
+ double[] probs = sentDetect.getSentenceProbabilities();
+ Assertions.assertEquals(2, probs.length);
+ }
+
+ // Reduced example taken from 'Sentences_NL.txt'
+ @Test
+ void testSentDetectWithInlineAbbreviationsEx2() {
+ // Here we have one abbreviations: "d.w.z." = dat wil zeggen (eng.: that
is to say)
+ final String sent1 = "Met het oog op de overvloed aan ideeën die de
analyse op elk " +
+ "afzonderlijk element van de droominhoud brengt, zullen sommige
lezers twijfels " +
+ "hebben over het principe of alles wat later tijdens de analyse in
je opkomt, " +
+ "tot de droomgedachten gerekend mag worden, d.w.z. of aangenomen
mag worden " +
+ "dat al deze gedachten al tijdens de slaaptoestand actief waren en
bijdroegen " +
+ "aan de vorming van de droom?";
Review Comment:
I was reviewing a pull request in Jena today, and noticed the `"""` for
multiline, which I normally use for Python, but never for Java. Maybe we can
use that in the future to simplify instead of `"strings strings strings " +` (I
think `"""` is Java 15 > ?).
> Add Dutch abbreviation dictionary
> ---------------------------------
>
> Key: OPENNLP-1554
> URL: https://issues.apache.org/jira/browse/OPENNLP-1554
> Project: OpenNLP
> Issue Type: Improvement
> Components: Sentence Detector, Tokenizer
> Affects Versions: 2.3.2
> Reporter: Martin Wiesner
> Assignee: Martin Wiesner
> Priority: Minor
> Fix For: 2.3.3
>
>
> Similar to the addition in OPENNLP-1526, an abbreviation dictionary for Dutch
> sentence detection and tokenisation might be beneficial.
> Aims:
> Create and add a new file {{abb_NL.xml}} in {{opennlp-tools/lang/nl}}
> Add basic set of test cases
--
This message was sent by Atlassian Jira
(v8.20.10#820010)