Re: [Apertium-stuff] fastmorph v5 - new version of fast corpus search engine

Mikel L. Forcada Wed, 01 Mar 2017 02:18:41 -0800

Hi Mansur,

it is cool that you have released your corpus search engine under a freelicense, but it is very unfortunate that the corpus itself has arestrictive, non-free license. We've talked in the past about this: Iassume there is no way to reverse this, due to the nature of the content.

Anyway, this is a typical example of a publicly funded effort that doesnot benefit society at large.


Regards

Mikel


El 27/02/17 a les 10:19, mansur ha escrit:

Dear colleagues,
We are happy to announce that the 5th version of our "*fastmorph*"corpus search engine is released. It is distributed under the GNUGeneral Public License v3.0 and published in GitHub [2].
We have been working on it since 2014 as the search engine for ourCorpus of Written Tatar [1].
*Features:*
- Advanced search options based on any combination of different searchparameters:
   * word form,
   * lemma,
   * set of morphological tags,
   * pattern matching (currently "*" and "?" masks are supported),
   * case matching,
   * distance to the next word.
- It receives search queries over UNIX Domain Socket file in JSON format.
- It stores all data in RAM, no MySQL queries after initialization.*Fastmorph* consumes about *800Mb operating memory* for the corpusconsisting *116 mln words* (140 mln tokens).
You can try it here:
http://search.corpus.tatar/search/index_en.html

*Some speed tests:*
Tests performed on machine with following characteristics:
    CPU: AMD FX-4100 Quad-Core Processor
    RAM: 16 Gb
    OS: CentOS release 6.8 (Final)
    fastmorph: compiled with 4 threads support, x64
*    Corpus size: 116 mln word occurences (140 mln tokens)
    Return full sentences with sources: 100*

Test results for different types of queries:

Query:
   Word 1: *китап*
Number of occurences: 32209
Query processing time: *0,4 sec.*

Query:
Word 1 (case sensitive, distance to the next word *up to 3* words):*Китап*
   Word 2 (if in brackets, then it is *lemma*):*(бир)*
Number of occurences: 15
Query processing time: *0,4 sec.*

Quite heavy query:
Word 1 (word begins with "б" letter, distance range to the nextword is from 1 to 10): *б**Word 2 (pronoun, word ends with "ң", distance range to the nextword is from 1 to 10): *<prn>*ң*
   Word 3 (lemma "кил", word ends with "р"): *(кил)*р*
Number of occurences: 135210
Query processing time:*0,8 sec.*

Very heavy query:
Word 1 (word ends with "ы", distance range to the next word is*from1 to 100*): **ы*Word 2 (word ends with "а", distance range to the next word is*from1 to 100*): **а*Word 3 (word ends with "м", distance range to the next word is*from 1 to 100*): **м*Word 4 (word ends with "с", distance range to the next word is*from1 to 100*): **с*Word 5 (word ends with "ь", distance range to the next word is*from 1 to 100*):**ь*
   Word 6 (word ends with "е"): **е*
Number of occurences: 135210
Query processing time: *1,4 sec.*


*Changelog:*
27.02.2017 - The 5th version of fastmorph corpus search engine isreleased. Now it consumes about 2,5 times less memory (RAM).
18.11.2016 - The 4th version of fastmorph corpus search engine isreleased. List of changes:
    - case sensitive search option was added;
    - the memory (RAM) usage by the search system is reduced twice;
- because of essential changes in the application architecture,search query performs now 3 - 5 times faster.Technical info: version 4 uses about 2 Gb RAM for the 116 mln wordscorpus.
19.07.2016 - Some improvements in the Complex morphological searchengine "fastmorph":- in addition to the existing mask "*", that matches any number ofany symbols, the mask "?", that represents any single character, wereadded. More information about it you can find in the updated Guides;- in the technical plan memory usage by the search system isreduced up to 25%.Technical info: version 3 uses about 4 Gb RAM for the 116 mln wordscorpus.
13.06.2016 - Search by the middle part of a word functionality wasadded in the fastmorph module. For example, if you type *әме*, wordslike ярдәмендә, бәйрәмен, үткәрәмен, өйдәме will be found...
21.04.2016 - Because of implementation in "fastmorph" module someprocessor optimizations and multithreading support we achieved thatcomplex morphological search now performs up to five times faster.
03.04.2016 - Complex morphological search system's features weresignificantly extended. You can get more info about them in The Guidesupdated up to 3.0 and higher version.
22.02.2016 - Complex morphological search function appeared in TheCorpus of Written Tatar, where you can use different combinations ofsuch parameters as wordform, lemma, grammatical tags, beginning andend of words, distances between them.Technical info: version 1 uses about 6 Gb RAM for the corpus,consisting of 116 mln word occurences. Its speed is quite high.
[1] http://corpus.tatar/en
[2] https://github.com/mansayk/fastmorph<https://github.com/mansayk/fastmorph>
With best wishes,
Mansur Saykhunov


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot


_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


--
Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] fastmorph v5 - new version of fast corpus search engine

Reply via email to