Am Tue, 17 Jan 2017 15:13:21 +0300 schrieb mansur <[email protected]>:
> Hello, everybody! > > I am trying to generate all forms of all words in the Apertium's Tatar > dictionary. Ilnar Salimzyanov recommended me this way: > > hfst-fst2strings -c 1 tat.automorf.hfst > file.txt > > It works great, but after launching it consumes more and more RAM > every second. So in about half an hour or maybe hour according to > 'top' command RAM usage by 'hfst-fst2strings' process is about 5,5 > Gb. How can I fix it? The memory usage shouldn't probably increase endlessly like that, so the way to fix it would be to debug the source code for potential leaks or so :-) I have had success in generating all forms of all words of Finnish one by one using script that restricts itself to generating one lemma per call to hfst-fst2strings. E.g. for lemma in cat dog mouse ; do echo $lemma | sed -e 's/./\0 /g' | sed -e 's/$/ %<n%>/' | hfst-regexp2fst -o temp.hfst | hfst-compose -v -F tat.automorf.hfst temp.hfst -o gen.hfst hfst-fst2strings gen.hfst | cut -d : -f 1 | sort |ə uniq done should work but I haven't tested it. Using Finnish morphology this can yield ~200 gigabytes to multiple terabytes of word-forms. I don't know if that is also the case for Tatar but something you should generally keep in mind when processing not your average indoeuropean languages like this. -- Doktor Tommi A Pirinen, Computational Linguist, <https://flammie.github.io/purplemonkeydishwasher/>, Universität Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D Entwickler. President of ACL SIGUR SIG for Uralic languages <http://gtweb.uit.no/sigur/>. I tend to follow inline-posting style in desktop e-mail messages. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
