i test solr as one of potential tools for the purpose of building a linguistic corpus.

i'd like to have your opinion, to which extent it would be a good choice.

the specifics, which i find deviate from the typical use of solr, are:

1. basic unit is a text (of a book, of a newspaper-article, etc.), with a bibliographic header looking at the examples of the solr-tutorial and the central concept of the "field", i am a bit confused how to map these on one another, ie would the whole text be one field "text"
and the bibheader-items individual fields?

2. real full-text (no stop-words, really every word is indexed, possibly even all the punctuation)

3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys on the word-level,
ie for every word we also have its lemma-value and its PoS.
In dedicated systems, this is implemented either as verticale (each word in one line):
word   lemma   pos
...
or in newer systems with xml-attributes:
<w pos="Noun" lemma="tree">trees</w>

Important is, that it has to be possible to mix this various layers in one query, eg:
"word(some) lemma(nice) pos(Noun)"

This seems to me to be the biggest challenge for solr.

4. indexing/searching-ratio
corpus is very static: the selection fo texts changes perhaps once a year (in production environment), so it doesnt really matter how long the indexing takes. Regarding the speed the emphasis is on the searches, which have to be "fast", exact and the results have to be further processable (kwic-view, thinning the solution, sorting, export, etc.). "Fast" is important also for more complex queries (ranges, boolean operators and prefixes mixed) and i say 10-15 seconds is the upper limit, which should be rather an exception to the rule of ~ 1 second.

5. also to regard the size: we are talking of multiples of 100 millions of tokens. the canonic example British National Corpus is 100 million, there are corpora with 2 billions tokens



thank you in advance

regards
matej

Reply via email to