Hello, Tommi!
1) Unfortunately I couldn't get this script to work, because I am not so
good in Apertium and HFST commands and their syntax :)
2) Terabytes of word-forms? Wow, that is quite much :)
Mansur
On Tue, Jan 17, 2017 at 3:57 PM, Tommi A Pirinen <
[email protected]> wrote:
> Am Tue, 17 Jan 2017 15:13:21 +0300
> schrieb mansur <[email protected]>:
>
> > Hello, everybody!
> >
> > I am trying to generate all forms of all words in the Apertium's Tatar
> > dictionary. Ilnar Salimzyanov recommended me this way:
> >
> > hfst-fst2strings -c 1 tat.automorf.hfst > file.txt
> >
> > It works great, but after launching it consumes more and more RAM
> > every second. So in about half an hour or maybe hour according to
> > 'top' command RAM usage by 'hfst-fst2strings' process is about 5,5
> > Gb. How can I fix it?
>
> The memory usage shouldn't probably increase endlessly like that, so the
> way to fix it would be to debug the source code for potential leaks
> or so :-)
>
> I have had success in generating all forms of all words of Finnish one
> by one using script that restricts itself to generating one lemma per
> call to hfst-fst2strings. E.g.
>
> for lemma in cat dog mouse ; do
> echo $lemma | sed -e 's/./\0 /g' | sed -e 's/$/ %<n%>/' |
> hfst-regexp2fst -o temp.hfst | hfst-compose -v -F tat.automorf.hfst
> temp.hfst -o gen.hfst
> hfst-fst2strings gen.hfst | cut -d : -f 1 | sort |ə uniq
> done
>
> should work but I haven't tested it.
>
> Using Finnish morphology this can yield ~200 gigabytes to multiple
> terabytes of word-forms. I don't know if that is also the case for
> Tatar but something you should generally keep in mind when processing
> not your average indoeuropean languages like this.
>
> --
> Doktor Tommi A Pirinen, Computational Linguist,
> <https://flammie.github.io/purplemonkeydishwasher/>, Universität
> Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D
> Entwickler. President of ACL SIGUR SIG for Uralic languages
> <http://gtweb.uit.no/sigur/>.
> I tend to follow inline-posting style in desktop e-mail messages.
>
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff