[ 
https://issues.apache.org/jira/browse/OPENNLP-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649516#comment-16649516
 ] 

J. Fiala commented on OPENNLP-1223:
-----------------------------------

>>> JF 14.10.: used full data for training/ only person sentences for 
>>> evaluation.
see updated model: tiger_2.2_namefinder_all.bin_20181014.bin.7z

> Add NameFinder model based on Tiger
> -----------------------------------
>
>                 Key: OPENNLP-1223
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1223
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: language model
>            Reporter: J. Fiala
>            Priority: Major
>         Attachments: tiger_2.2_namefinder.bin.7z, 
> tiger_2.2_namefinder.testdata.txt, 
> tiger_2.2_namefinder_all.bin_20181014.bin.7z, tiger_2.2_namefinder_eval.txt
>
>
> Add NameFinder model based on the Tiger treebank 2.2 (Universität Stuttgart - 
> www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
>  
> 1.) add model based on tiger (/)
> >>> generated based on 6.271 sentences with tagged names (always given name + 
> >>> surname).
> 2.) add a few test sentences (/)
> 3.) add small evaluation file (/)
>  
> h3. Input data
>  * tigercorpus-2.2.conll09.tar.gz (Uni Stuttgart)
>  www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html
>  * yagoLabels.tsv.7z (Max Planck Institute)
>  
> [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/]
> h3. Basic workflow
> 1.) Extract sentences in the tiger database with possible names (two words in 
> sequence tagged as NE)
> 2.) Check if possible names include a given name based on the YAGO labels 
> database (given name is assumed as first name)
> 3.) If given name is included in YAGO labels as givenName, then tag the 
> person name
> 4.) Train with full data set (50.472 sentences - including non-names)
> 5.) Evaluate with person data set (6.271 sentences)
> >>> JF 14.10.: see updated model: tiger_2.2_namefinder_all.bin_20181014.bin.7z
> h3. Open questions
> I first extracted 6.271 sentences mentioning names and trained based on that 
> (filtered) data. Or is it better to use the complete training data (including 
> the sentences without names)? (/)
> >>> JF 14.10.: added steps 4 + 5
> h3. Results
> Results from step 5 above:
> Evaluated 6271 samples with 7659 entities; found: 7662 entities; correct: 
> 7644.
>         TOTAL: precision:   99,77%;  recall:   99,80%; F1:   99,78%.
>        person: precision:   99,77%;  recall:   99,80%; F1:   99,78%. [target: 
> 7659; tp: 7644; fp:  18]
>  
> h3. Further Improvements:
> 1.) There may be some names which are referring to locations which have to be 
> refined (e.g. San Juan):
> Fünf bis sechs Stunden , damit sie zur Besinnung kommen , meint 
> <START:person> Salvador Lopez <END>Gonzalez , das Oberhaupt von 
> <START:person> San Juan <END> <START:person> Juan Chamula <END> , einem 
> pittoresken Ort hoch in den Bergen von .).
> 2.) Add support for names with more than two words (e.g. Salvador Lopez 
> Gonzalez above).
> 3.) Check for context-sensitive non-name matches (e.g. "General")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to