Bahodir Mansurov <[email protected]> čálii: > Hello, > > I've been trying to develop HFST and TWOL files for the Uzbek language > by looking at how other similar languages (Tatar, Kazakh, etc.) have > done it. Those language rules are very complex, at least for someone > who doesn't know where to start reading. I usually look for a word and > then go backwords deciphering the rule chain to make sense of it. The > chain gets so long that I start forgetting the start of the rules. So > copying and pasting existing solutions and modifying them didn't appeal > to me. That's why I started adding simple rules first and then > expanding them for each use case. You can see my progress at [1] and > [2] (My previous work using the DIX format got so out of hand that I > gave up developing it.). > > As I keep adding or changing more and more rules to fit new usecases, I > realize that I maybe breaking old usecases. That's why I'd like to > create test cases first and then change the rules and not be worried > that I broke any previous work. Are there any such tools that you use?
My favourite method is running a corpus through and diffing: <corpus.txt apertium -f html-noent fie-bar-dgen > output.1 edit *fie-bar.dix # hack hack hack make -j <corpus.txt apertium -f html-noent fie-bar-dgen > output.2 diff -u output.1 output.2 | dwdiff -c --diff-input This gives a "big picture" view of what actually improves/degrades for that language pair, and avoids the noise of changes that only affect rare words/analyses. You can use the same method for monolingual data, preferably on the disambiguated output since those are the only analyses that end up mattering anyway.
signature.asc
Description: PGP signature
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
