Re: [Apertium-stuff] fastmorph v5 - new version of fast corpus search engine

mansur Wed, 01 Mar 2017 04:13:01 -0800

Hello, Mikel

Unfortunately yes, as long as we don't have permissions from all people,
from all resources where we took texts, we cannot publish them anywhere. On
the other side, corpus lets you to see separate sentences according to your
query. Yes, you cannot restore the whole text, but most researchers don't
need it.
Anyway, it is the problem of the most corpora in the world, and I don't
know what can I do here.


I am trying to make corpus useful everywhere it is possible:
Currently I use frequency list to add new words to apertium-tat (you can
see my commits there). I also published many other frequency lists here:
http://corpus.tatar/stat_en.htm including frequency list of words absent in
apertium-tat (about a couple of years ago maybe), but nobody included them
since then... I understand that Ilnar doesn't have enought time, but
interesting, where are other Tatar people?

About your quite offensive sentence: "Anyway, this is a typical example of
a publicly funded effort that does not benefit society at large."
1) Our team of enthusiasts is very small, unfortunately there is no much
help from other people here.
2) We don't get any financial support, at all. At all! During these six
years of working on the corpus I haven't get any money for that. I just do
it as a hobby at my spare time.
3) We add different features all the time to help Tatar linguists (and
others) in their research and work: different types of search, n-grams,
speech synthesizer, spellchecker... So I think, according the circumstances,
it benefits society quite well...

I think the main problem here is not in the availability of text (or other)
resources, but in absence of Tatar speaking people wishing to participate
in free open projects.

If you have some ideas or suggestions, please let me know, we can always
discuss it.

All the best,
Mansur

On Wed, Mar 1, 2017 at 1:17 PM, Mikel L. Forcada <[email protected]> wrote:

> Hi Mansur,
>
> it is cool that you have released your corpus search engine under a free
> license, but it is very unfortunate that the corpus itself has a
> restrictive, non-free license. We've talked in the past about this: I
> assume there is no way to reverse this, due to the nature of the content.
>
> Anyway, this is a typical example of a publicly funded effort that does
> not benefit society at large.
>
> Regards
> Mikel
>
>
> El 27/02/17 a les 10:19, mansur ha escrit:
>
> Dear colleagues,
>
> We are happy to announce that the 5th version of our "*fastmorph*" corpus
> search engine is released. It is distributed under the GNU General Public
> License v3.0 and published in GitHub [2].
>
> We have been working on it since 2014 as the search engine for our Corpus
> of Written Tatar [1].
>
> *Features:*
> - Advanced search options based on any combination of different search
> parameters:
>    * word form,
>    * lemma,
>    * set of morphological tags,
>    * pattern matching (currently "*" and "?" masks are supported),
>    * case matching,
>    * distance to the next word.
> - It receives search queries over UNIX Domain Socket file in JSON format.
> - It stores all data in RAM, no MySQL queries after initialization.
> *Fastmorph* consumes about *800Mb operating memory* for the corpus
> consisting *116 mln words* (140 mln tokens).
>
> You can try it here:
> http://search.corpus.tatar/search/index_en.html
>
> *Some speed tests:*
> Tests performed on machine with following characteristics:
>     CPU: AMD FX-4100 Quad-Core Processor
>     RAM: 16 Gb
>     OS: CentOS release 6.8 (Final)
>     fastmorph: compiled with 4 threads support, x64
>
> *    Corpus size: 116 mln word occurences (140 mln tokens)     Return full
> sentences with sources: 100*
>
> Test results for different types of queries:
>
> Query:
>    Word 1: *китап*
> Number of occurences: 32209
> Query processing time: *0,4 sec.*
>
> Query:
>    Word 1 (case sensitive, distance to the next word *up to 3* words):
> *Китап*
>    Word 2 (if in brackets, then it is *lemma*):* (бир)*
> Number of occurences: 15
> Query processing time: *0,4 sec.*
>
> Quite heavy query:
>    Word 1 (word begins with "б" letter, distance range to the next word is
> from 1 to 10): *б**
>    Word 2 (pronoun, word ends with "ң", distance range to the next word is
> from 1 to 10): *<prn>*ң*
>    Word 3 (lemma "кил", word ends with "р"): *(кил)*р*
> Number of occurences: 135210
> Query processing time:* 0,8 sec.*
>
> Very heavy query:
>    Word 1 (word ends with "ы", distance range to the next word is* from 1
> to 100*): **ы*
>    Word 2 (word ends with "а", distance range to the next word is* from 1
> to 100*): **а*
>    Word 3 (word ends with "м", distance range to the next word is *from 1
> to 100*): **м*
>    Word 4 (word ends with "с", distance range to the next word is* from 1
> to 100*): **с*
>    Word 5 (word ends with "ь", distance range to the next word is *from 1
> to 100*):* *ь*
>    Word 6 (word ends with "е"): **е*
> Number of occurences: 135210
> Query processing time: *1,4 sec.*
>
>
> *Changelog:*
>
> 27.02.2017 - The 5th version of fastmorph corpus search engine is
> released. Now it consumes about 2,5 times less memory (RAM).
>
> 18.11.2016 - The 4th version of fastmorph corpus search engine is
> released. List of changes:
>     - case sensitive search option was added;
>     - the memory (RAM) usage by the search system is reduced twice;
>     - because of essential changes in the application architecture, search
> query performs now 3 - 5 times faster.
> Technical info: version 4 uses about 2 Gb RAM for the 116 mln words corpus.
>
> 19.07.2016 - Some improvements in the Complex morphological search engine
> "fastmorph":
>     - in addition to the existing mask "*", that matches any number of any
> symbols, the mask "?", that represents any single character, were added.
> More information about it you can find in the updated Guides;
>     - in the technical plan memory usage by the search system is reduced
> up to 25%.
> Technical info: version 3 uses about 4 Gb RAM for the 116 mln words
> corpus.
>
> 13.06.2016 - Search by the middle part of a word functionality was added
> in the fastmorph module. For example, if you type *әме*, words like
> ярдәмендә, бәйрәмен, үткәрәмен, өйдәме will be found...
>
> 21.04.2016 - Because of implementation in "fastmorph" module some
> processor optimizations and multithreading support we achieved that complex
> morphological search now performs up to five times faster.
>
> 03.04.2016 - Complex morphological search system's features were
> significantly extended. You can get more info about them in The Guides
> updated up to 3.0 and higher version.
>
> 22.02.2016 - Complex morphological search function appeared in The Corpus
> of Written Tatar, where you can use different combinations of such
> parameters as wordform, lemma, grammatical tags, beginning and end of
> words, distances between them.
> Technical info: version 1 uses about 6 Gb RAM for the corpus, consisting
> of 116 mln word occurences. Its speed is quite high.
>
> [1] http://corpus.tatar/en
> [2] https://github.com/mansayk/fastmorph
>
> With best wishes,
> Mansur Saykhunov
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>
>
>
> _______________________________________________
> Apertium-stuff mailing 
> [email protected]https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> --
> Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
> Departament de Llenguatges i Sistemes Informàtics
> Universitat d'Alacant
> E-03690 Sant Vicent del Raspeig
> Spain
> Office: +34 96 590 9776
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] fastmorph v5 - new version of fast corpus search engine

Reply via email to