[
https://issues.apache.org/jira/browse/OPENNLP-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649517#comment-16649517
]
J. Fiala commented on OPENNLP-1223:
-----------------------------------
Dear Daniel,
I updated https://issues.apache.org/jira/browse/OPENNLP-1223 and added the
updated model now trained on all of the tiger data (50.472 sentences).
Evaluation is done only on sentences containing names (6.271 sentences).
For restrictions see "Further improvements" in the issue.
Best regards,
Johannes
> Add NameFinder model based on Tiger
> -----------------------------------
>
> Key: OPENNLP-1223
> URL: https://issues.apache.org/jira/browse/OPENNLP-1223
> Project: OpenNLP
> Issue Type: New Feature
> Components: language model
> Reporter: J. Fiala
> Priority: Major
> Attachments: tiger_2.2_namefinder.bin.7z,
> tiger_2.2_namefinder.testdata.txt,
> tiger_2.2_namefinder_all.bin_20181014.bin.7z, tiger_2.2_namefinder_eval.txt
>
>
> Add NameFinder model based on the Tiger treebank 2.2 (Universität Stuttgart -
> www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
>
> 1.) add model based on tiger (/)
> >>> generated based on 6.271 sentences with tagged names (always given name +
> >>> surname).
> 2.) add a few test sentences (/)
> 3.) add small evaluation file (/)
>
> h3. Input data
> * tigercorpus-2.2.conll09.tar.gz (Uni Stuttgart)
> www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html
> * yagoLabels.tsv.7z (Max Planck Institute)
>
> [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/]
> h3. Basic workflow
> 1.) Extract sentences in the tiger database with possible names (two words in
> sequence tagged as NE)
> 2.) Check if possible names include a given name based on the YAGO labels
> database (given name is assumed as first name)
> 3.) If given name is included in YAGO labels as givenName, then tag the
> person name
> 4.) Train with full data set (50.472 sentences - including non-names)
> 5.) Evaluate with person data set (6.271 sentences)
> >>> JF 14.10.: see updated model: tiger_2.2_namefinder_all.bin_20181014.bin.7z
> h3. Open questions
> I first extracted 6.271 sentences mentioning names and trained based on that
> (filtered) data. Or is it better to use the complete training data (including
> the sentences without names)? (/)
> >>> JF 14.10.: added steps 4 + 5
> h3. Results
> Results from step 5 above:
> Evaluated 6271 samples with 7659 entities; found: 7662 entities; correct:
> 7644.
> TOTAL: precision: 99,77%; recall: 99,80%; F1: 99,78%.
> person: precision: 99,77%; recall: 99,80%; F1: 99,78%. [target:
> 7659; tp: 7644; fp: 18]
>
> h3. Further Improvements:
> 1.) There may be some names which are referring to locations which have to be
> refined (e.g. San Juan):
> Fünf bis sechs Stunden , damit sie zur Besinnung kommen , meint
> <START:person> Salvador Lopez <END>Gonzalez , das Oberhaupt von
> <START:person> San Juan <END> <START:person> Juan Chamula <END> , einem
> pittoresken Ort hoch in den Bergen von .).
> 2.) Add support for names with more than two words (e.g. Salvador Lopez
> Gonzalez above).
> 3.) Check for context-sensitive non-name matches (e.g. "General")
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)