[
https://issues.apache.org/jira/browse/OPENNLP-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martin Wiesner updated OPENNLP-1528:
------------------------------------
Issue Type: Task (was: Bug)
> Review Catalan regexp for the ela germinada
> -------------------------------------------
>
> Key: OPENNLP-1528
> URL: https://issues.apache.org/jira/browse/OPENNLP-1528
> Project: OpenNLP
> Issue Type: Task
> Reporter: Bruno P. Kinoshita
> Assignee: Bruno P. Kinoshita
> Priority: Minor
> Attachments: image-2023-12-11-15-20-31-518.png
>
>
> I shared on Twitter about the issue with the word "ós" found in our tokenizer
> tests, and Joan Montané (unjoanqualsevol on Twitter) replied pointing that
> our regexp for Catalan didn't seem right.
> Created this issue so we can test & fix it.
> > Regexp is not fully correct. Catalan written language uses middle dot /
> >interpunct (U+00B7) as inner word character: cel·la, goril·la, instal·lar,
> >cancel·lar,...
> !image-2023-12-11-15-20-31-518.png|width=365,height=429!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)