i test solr as one of potential tools for the purpose of building a
linguistic corpus.
i'd like to have your opinion, to which extent it would be a good choice.
the specifics, which i find deviate from the typical use of solr, are:
1. basic unit is a text (of a book, of a newspaper-article, etc.), with
a bibliographic header
looking at the examples of the solr-tutorial and the central concept of
the "field",
i am a bit confused how to map these on one another, ie would the whole
text be one field "text"
and the bibheader-items individual fields?
2. real full-text (no stop-words, really every word is indexed, possibly
even all the punctuation)
3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys
on the word-level,
ie for every word we also have its lemma-value and its PoS.
In dedicated systems, this is implemented either as verticale (each word
in one line):
word lemma pos
...
or in newer systems with xml-attributes:
<w pos="Noun" lemma="tree">trees</w>
Important is, that it has to be possible to mix this various layers in
one query, eg:
"word(some) lemma(nice) pos(Noun)"
This seems to me to be the biggest challenge for solr.
4. indexing/searching-ratio
corpus is very static: the selection fo texts changes perhaps once a
year (in production environment),
so it doesnt really matter how long the indexing takes. Regarding the
speed the emphasis is on the searches,
which have to be "fast", exact and the results have to be further
processable (kwic-view, thinning the solution, sorting, export, etc.).
"Fast" is important also for more complex queries (ranges, boolean
operators and prefixes mixed)
and i say 10-15 seconds is the upper limit, which should be rather an
exception to the rule of ~ 1 second.
5. also to regard the size: we are talking of multiples of 100 millions
of tokens.
the canonic example British National Corpus is 100 million, there are
corpora with 2 billions tokens
thank you in advance
regards
matej