Am Tue, 17 Jan 2017 15:13:21 +0300
schrieb mansur <[email protected]>:

> Hello, everybody!
> 
> I am trying to generate all forms of all words in the Apertium's Tatar
> dictionary. Ilnar Salimzyanov recommended me this way:
> 
> hfst-fst2strings -c 1 tat.automorf.hfst > file.txt
> 
> It works great, but after launching it consumes more and more RAM
> every second. So in about half an hour or maybe hour according to
> 'top' command RAM usage by 'hfst-fst2strings' process is about 5,5
> Gb. How can I fix it?

The memory usage shouldn't probably increase endlessly like that, so the
way to fix it would be to debug the source code for potential leaks
or so :-)

I have had success in generating all forms of all words of Finnish one
by one using script that restricts itself to generating one lemma per
call to hfst-fst2strings. E.g.

for lemma in cat dog mouse ; do
  echo $lemma | sed -e 's/./\0 /g' | sed -e 's/$/ %<n%>/' |
  hfst-regexp2fst -o temp.hfst | hfst-compose -v -F  tat.automorf.hfst
  temp.hfst -o gen.hfst
  hfst-fst2strings gen.hfst | cut -d : -f 1 | sort |ə uniq
done

should work but I haven't tested it. 

Using Finnish morphology this can yield ~200 gigabytes to multiple
terabytes of word-forms. I don't know if that is also the case for
Tatar but something you should generally keep in mind when processing
not your average indoeuropean languages like this.

-- 
Doktor Tommi A Pirinen, Computational Linguist,
<https://flammie.github.io/purplemonkeydishwasher/>, Universität
Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D
Entwickler.  President of ACL SIGUR SIG for Uralic languages
<http://gtweb.uit.no/sigur/>.
I tend to follow inline-posting style in desktop e-mail messages.



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to