[Replies inline]
On Fri, 16 Feb 2018 10:50:30 +0100
Tino Didriksen <[email protected]>
wrote:

Thanks for the message, you are right, this requires some sorting out,
here's few short notes from my / old HFST perspective:

> We have these 3 tasks about adding features to lttoolbox:
> 
> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code#Robust_tokenisation_in_lttoolbox

This is currently implemented in HFST as hfst-tokenise, but it's
relatively new tool so I have no idea about its workings. I do know it
depends on using XFST debugging language as scripting language for
tokenisers.

> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code#Extend_lttoolbox_to_have_the_power_of_HFST

This encompasses some features of Finite State Morphology
(Karttunen & Beesley, 2004) and it's secret appendix, TWOL. E.g.
morphographemics (twol), discontinuous morphs (flag diacritics), maybe
reduplications... Also proper cyclic morphotactics.

> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code#Add_weights_to_lttoolbox

And this is all the experiments me and HFST team published in
early 2000's till now plus anything else people have ever learnt about
statistical NLP and WFST.

> Why don't we instead add .dix format support to HFST, add weights to
> that format, drop lttoolbox, and just use HFST wholesale? It can do
> all the things we want - just need dix support.

Extending .dix format(s) to contain necessaries for morphographemics,
weights (or probabilites), etc. is one of the end goals for me. Getting
rid of lexc, xfst, and twolc legacy formats would be desirable. The
underlying libraries are less relevant.  Of course having lttoolbox
support everything would not be a bad thing nevertheless. Either way, I
think these tasks can be used to submit a gsoc application that
includes 3 months of work (or combined into one).


-- 
Doktor Tommi A Pirinen, Computational Linguist,
<https://flammie.github.io/purplemonkeydishwasher/>, Universität
Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D
Entwickler.  President of ACL SIGUR SIG for Uralic languages
<http://gtweb.uit.no/sigur/>.
I tend to follow inline-posting style in desktop e-mail messages.



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to