[Replies inline] On Fri, 16 Feb 2018 10:50:30 +0100 Tino Didriksen <[email protected]> wrote:
Thanks for the message, you are right, this requires some sorting out, here's few short notes from my / old HFST perspective: > We have these 3 tasks about adding features to lttoolbox: > > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code#Robust_tokenisation_in_lttoolbox This is currently implemented in HFST as hfst-tokenise, but it's relatively new tool so I have no idea about its workings. I do know it depends on using XFST debugging language as scripting language for tokenisers. > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code#Extend_lttoolbox_to_have_the_power_of_HFST This encompasses some features of Finite State Morphology (Karttunen & Beesley, 2004) and it's secret appendix, TWOL. E.g. morphographemics (twol), discontinuous morphs (flag diacritics), maybe reduplications... Also proper cyclic morphotactics. > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code#Add_weights_to_lttoolbox And this is all the experiments me and HFST team published in early 2000's till now plus anything else people have ever learnt about statistical NLP and WFST. > Why don't we instead add .dix format support to HFST, add weights to > that format, drop lttoolbox, and just use HFST wholesale? It can do > all the things we want - just need dix support. Extending .dix format(s) to contain necessaries for morphographemics, weights (or probabilites), etc. is one of the end goals for me. Getting rid of lexc, xfst, and twolc legacy formats would be desirable. The underlying libraries are less relevant. Of course having lttoolbox support everything would not be a bad thing nevertheless. Either way, I think these tasks can be used to submit a gsoc application that includes 3 months of work (or combined into one). -- Doktor Tommi A Pirinen, Computational Linguist, <https://flammie.github.io/purplemonkeydishwasher/>, Universität Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D Entwickler. President of ACL SIGUR SIG for Uralic languages <http://gtweb.uit.no/sigur/>. I tend to follow inline-posting style in desktop e-mail messages. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
