solr for corpus?

xnrn Sat, 12 May 2007 10:28:35 -0700

i test solr as one of potential tools for the purpose of building alinguistic corpus.


i'd like to have your opinion, to which extent it would be a good choice.


the specifics, which i find deviate from the typical use of solr, are:

1. basic unit is a text (of a book, of a newspaper-article, etc.), witha bibliographic headerlooking at the examples of the solr-tutorial and the central concept ofthe "field",i am a bit confused how to map these on one another, ie would the wholetext be one field "text"

and the bibheader-items individual fields?

2. real full-text (no stop-words, really every word is indexed, possiblyeven all the punctuation)

3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keyson the word-level,

ie for every word we also have its lemma-value and its PoS.

In dedicated systems, this is implemented either as verticale (each wordin one line):

word   lemma   pos
...
or in newer systems with xml-attributes:
<w pos="Noun" lemma="tree">trees</w>

Important is, that it has to be possible to mix this various layers inone query, eg:

"word(some) lemma(nice) pos(Noun)"

This seems to me to be the biggest challenge for solr.

4. indexing/searching-ratio

corpus is very static: the selection fo texts changes perhaps once ayear (in production environment),so it doesnt really matter how long the indexing takes. Regarding thespeed the emphasis is on the searches,which have to be "fast", exact and the results have to be furtherprocessable (kwic-view, thinning the solution, sorting, export, etc.)."Fast" is important also for more complex queries (ranges, booleanoperators and prefixes mixed)and i say 10-15 seconds is the upper limit, which should be rather anexception to the rule of ~ 1 second.

5. also to regard the size: we are talking of multiples of 100 millionsof tokens.the canonic example British National Corpus is 100 million, there arecorpora with 2 billions tokens




thank you in advance

regards
matej

solr for corpus?

Reply via email to