[
https://issues.apache.org/jira/browse/OPENNLP-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649516#comment-16649516
]
J. Fiala commented on OPENNLP-1223:
-----------------------------------
>>> JF 14.10.: used full data for training/ only person sentences for
>>> evaluation.
see updated model: tiger_2.2_namefinder_all.bin_20181014.bin.7z
> Add NameFinder model based on Tiger
> -----------------------------------
>
> Key: OPENNLP-1223
> URL: https://issues.apache.org/jira/browse/OPENNLP-1223
> Project: OpenNLP
> Issue Type: New Feature
> Components: language model
> Reporter: J. Fiala
> Priority: Major
> Attachments: tiger_2.2_namefinder.bin.7z,
> tiger_2.2_namefinder.testdata.txt,
> tiger_2.2_namefinder_all.bin_20181014.bin.7z, tiger_2.2_namefinder_eval.txt
>
>
> Add NameFinder model based on the Tiger treebank 2.2 (Universität Stuttgart -
> www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
>
> 1.) add model based on tiger (/)
> >>> generated based on 6.271 sentences with tagged names (always given name +
> >>> surname).
> 2.) add a few test sentences (/)
> 3.) add small evaluation file (/)
>
> h3. Input data
> * tigercorpus-2.2.conll09.tar.gz (Uni Stuttgart)
> www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html
> * yagoLabels.tsv.7z (Max Planck Institute)
>
> [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/]
> h3. Basic workflow
> 1.) Extract sentences in the tiger database with possible names (two words in
> sequence tagged as NE)
> 2.) Check if possible names include a given name based on the YAGO labels
> database (given name is assumed as first name)
> 3.) If given name is included in YAGO labels as givenName, then tag the
> person name
> 4.) Train with full data set (50.472 sentences - including non-names)
> 5.) Evaluate with person data set (6.271 sentences)
> >>> JF 14.10.: see updated model: tiger_2.2_namefinder_all.bin_20181014.bin.7z
> h3. Open questions
> I first extracted 6.271 sentences mentioning names and trained based on that
> (filtered) data. Or is it better to use the complete training data (including
> the sentences without names)? (/)
> >>> JF 14.10.: added steps 4 + 5
> h3. Results
> Results from step 5 above:
> Evaluated 6271 samples with 7659 entities; found: 7662 entities; correct:
> 7644.
> TOTAL: precision: 99,77%; recall: 99,80%; F1: 99,78%.
> person: precision: 99,77%; recall: 99,80%; F1: 99,78%. [target:
> 7659; tp: 7644; fp: 18]
>
> h3. Further Improvements:
> 1.) There may be some names which are referring to locations which have to be
> refined (e.g. San Juan):
> Fünf bis sechs Stunden , damit sie zur Besinnung kommen , meint
> <START:person> Salvador Lopez <END>Gonzalez , das Oberhaupt von
> <START:person> San Juan <END> <START:person> Juan Chamula <END> , einem
> pittoresken Ort hoch in den Bergen von .).
> 2.) Add support for names with more than two words (e.g. Salvador Lopez
> Gonzalez above).
> 3.) Check for context-sensitive non-name matches (e.g. "General")
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)