Hi Mansur,
it is cool that you have released your corpus search engine under a free
license, but it is very unfortunate that the corpus itself has a
restrictive, non-free license. We've talked in the past about this: I
assume there is no way to reverse this, due to the nature of the content.
Anyway, this is a typical example of a publicly funded effort that does
not benefit society at large.
Regards
Mikel
El 27/02/17 a les 10:19, mansur ha escrit:
Dear colleagues,
We are happy to announce that the 5th version of our "*fastmorph*"
corpus search engine is released. It is distributed under the GNU
General Public License v3.0 and published in GitHub [2].
We have been working on it since 2014 as the search engine for our
Corpus of Written Tatar [1].
*Features:*
- Advanced search options based on any combination of different search
parameters:
* word form,
* lemma,
* set of morphological tags,
* pattern matching (currently "*" and "?" masks are supported),
* case matching,
* distance to the next word.
- It receives search queries over UNIX Domain Socket file in JSON format.
- It stores all data in RAM, no MySQL queries after initialization.
*Fastmorph* consumes about *800Mb operating memory* for the corpus
consisting *116 mln words* (140 mln tokens).
You can try it here:
http://search.corpus.tatar/search/index_en.html
*Some speed tests:*
Tests performed on machine with following characteristics:
CPU: AMD FX-4100 Quad-Core Processor
RAM: 16 Gb
OS: CentOS release 6.8 (Final)
fastmorph: compiled with 4 threads support, x64
* Corpus size: 116 mln word occurences (140 mln tokens)
Return full sentences with sources: 100*
Test results for different types of queries:
Query:
Word 1: *китап*
Number of occurences: 32209
Query processing time: *0,4 sec.*
Query:
Word 1 (case sensitive, distance to the next word *up to 3* words):
*Китап*
Word 2 (if in brackets, then it is *lemma*):*(бир)*
Number of occurences: 15
Query processing time: *0,4 sec.*
Quite heavy query:
Word 1 (word begins with "б" letter, distance range to the next
word is from 1 to 10): *б**
Word 2 (pronoun, word ends with "ң", distance range to the next
word is from 1 to 10): *<prn>*ң*
Word 3 (lemma "кил", word ends with "р"): *(кил)*р*
Number of occurences: 135210
Query processing time:*0,8 sec.*
Very heavy query:
Word 1 (word ends with "ы", distance range to the next word is*from
1 to 100*): **ы*
Word 2 (word ends with "а", distance range to the next word is*from
1 to 100*): **а*
Word 3 (word ends with "м", distance range to the next word is
*from 1 to 100*): **м*
Word 4 (word ends with "с", distance range to the next word is*from
1 to 100*): **с*
Word 5 (word ends with "ь", distance range to the next word is
*from 1 to 100*):**ь*
Word 6 (word ends with "е"): **е*
Number of occurences: 135210
Query processing time: *1,4 sec.*
*Changelog:*
27.02.2017 - The 5th version of fastmorph corpus search engine is
released. Now it consumes about 2,5 times less memory (RAM).
18.11.2016 - The 4th version of fastmorph corpus search engine is
released. List of changes:
- case sensitive search option was added;
- the memory (RAM) usage by the search system is reduced twice;
- because of essential changes in the application architecture,
search query performs now 3 - 5 times faster.
Technical info: version 4 uses about 2 Gb RAM for the 116 mln words
corpus.
19.07.2016 - Some improvements in the Complex morphological search
engine "fastmorph":
- in addition to the existing mask "*", that matches any number of
any symbols, the mask "?", that represents any single character, were
added. More information about it you can find in the updated Guides;
- in the technical plan memory usage by the search system is
reduced up to 25%.
Technical info: version 3 uses about 4 Gb RAM for the 116 mln words
corpus.
13.06.2016 - Search by the middle part of a word functionality was
added in the fastmorph module. For example, if you type *әме*, words
like ярдәмендә, бәйрәмен, үткәрәмен, өйдәме will be found...
21.04.2016 - Because of implementation in "fastmorph" module some
processor optimizations and multithreading support we achieved that
complex morphological search now performs up to five times faster.
03.04.2016 - Complex morphological search system's features were
significantly extended. You can get more info about them in The Guides
updated up to 3.0 and higher version.
22.02.2016 - Complex morphological search function appeared in The
Corpus of Written Tatar, where you can use different combinations of
such parameters as wordform, lemma, grammatical tags, beginning and
end of words, distances between them.
Technical info: version 1 uses about 6 Gb RAM for the corpus,
consisting of 116 mln word occurences. Its speed is quite high.
[1] http://corpus.tatar/en
[2] https://github.com/mansayk/fastmorph
<https://github.com/mansayk/fastmorph>
With best wishes,
Mansur Saykhunov
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
--
Mikel L. Forcada http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff