[Apertium-stuff] fastmorph v5 - new version of fast corpus search engine

mansur Mon, 27 Feb 2017 02:20:36 -0800

Dear colleagues,

We are happy to announce that the 5th version of our "*fastmorph*" corpus
search engine is released. It is distributed under the GNU General Public
License v3.0 and published in GitHub [2].


We have been working on it since 2014 as the search engine for our Corpus
of Written Tatar [1].

*Features:*
- Advanced search options based on any combination of different search
parameters:
   * word form,
   * lemma,
   * set of morphological tags,
   * pattern matching (currently "*" and "?" masks are supported),
   * case matching,
   * distance to the next word.
- It receives search queries over UNIX Domain Socket file in JSON format.
- It stores all data in RAM, no MySQL queries after initialization.
*Fastmorph* consumes about *800Mb operating memory* for the corpus
consisting *116 mln words* (140 mln tokens).

You can try it here:
http://search.corpus.tatar/search/index_en.html

*Some speed tests:*
Tests performed on machine with following characteristics:
    CPU: AMD FX-4100 Quad-Core Processor
    RAM: 16 Gb
    OS: CentOS release 6.8 (Final)
    fastmorph: compiled with 4 threads support, x64

*    Corpus size: 116 mln word occurences (140 mln tokens)    Return full
sentences with sources: 100*

Test results for different types of queries:

Query:
   Word 1: *китап*
Number of occurences: 32209
Query processing time: *0,4 sec.*

Query:
   Word 1 (case sensitive, distance to the next word *up to 3* words):
*Китап*
   Word 2 (if in brackets, then it is *lemma*):* (бир)*
Number of occurences: 15
Query processing time: *0,4 sec.*

Quite heavy query:
   Word 1 (word begins with "б" letter, distance range to the next word is
from 1 to 10): *б**
   Word 2 (pronoun, word ends with "ң", distance range to the next word is
from 1 to 10): *<prn>*ң*
   Word 3 (lemma "кил", word ends with "р"): *(кил)*р*
Number of occurences: 135210
Query processing time:* 0,8 sec.*

Very heavy query:
   Word 1 (word ends with "ы", distance range to the next word is* from 1
to 100*): **ы*
   Word 2 (word ends with "а", distance range to the next word is* from 1
to 100*): **а*
   Word 3 (word ends with "м", distance range to the next word is *from 1
to 100*): **м*
   Word 4 (word ends with "с", distance range to the next word is* from 1
to 100*): **с*
   Word 5 (word ends with "ь", distance range to the next word is *from 1
to 100*):* *ь*
   Word 6 (word ends with "е"): **е*
Number of occurences: 135210
Query processing time: *1,4 sec.*


*Changelog:*

27.02.2017 - The 5th version of fastmorph corpus search engine is released.
Now it consumes about 2,5 times less memory (RAM).

18.11.2016 - The 4th version of fastmorph corpus search engine is released.
List of changes:
    - case sensitive search option was added;
    - the memory (RAM) usage by the search system is reduced twice;
    - because of essential changes in the application architecture, search
query performs now 3 - 5 times faster.
Technical info: version 4 uses about 2 Gb RAM for the 116 mln words corpus.

19.07.2016 - Some improvements in the Complex morphological search engine
"fastmorph":
    - in addition to the existing mask "*", that matches any number of any
symbols, the mask "?", that represents any single character, were added.
More information about it you can find in the updated Guides;
    - in the technical plan memory usage by the search system is reduced up
to 25%.
Technical info: version 3 uses about 4 Gb RAM for the 116 mln words corpus.

13.06.2016 - Search by the middle part of a word functionality was added in
the fastmorph module. For example, if you type *әме*, words like ярдәмендә,
бәйрәмен, үткәрәмен, өйдәме will be found...

21.04.2016 - Because of implementation in "fastmorph" module some processor
optimizations and multithreading support we achieved that complex
morphological search now performs up to five times faster.

03.04.2016 - Complex morphological search system's features were
significantly extended. You can get more info about them in The Guides
updated up to 3.0 and higher version.

22.02.2016 - Complex morphological search function appeared in The Corpus
of Written Tatar, where you can use different combinations of such
parameters as wordform, lemma, grammatical tags, beginning and end of
words, distances between them.
Technical info: version 1 uses about 6 Gb RAM for the corpus, consisting of
116 mln word occurences. Its speed is quite high.

[1] http://corpus.tatar/en
[2] https://github.com/mansayk/fastmorph

With best wishes,
Mansur Saykhunov

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] fastmorph v5 - new version of fast corpus search engine

Reply via email to