Hi matej,

since I didn't see anyone answering your question yet, I'll have a go at it, but I'm not one of the Solr developers, I've just used it so far and am very happy with it. I use it for searching literary texts, storing information from a SQL database in the Solr documents as metadata for the texts.


[EMAIL PROTECTED] schreef:
i test solr as one of potential tools for the purpose of building a linguistic corpus.

i'd like to have your opinion, to which extent it would be a good choice.

the specifics, which i find deviate from the typical use of solr, are:

1. basic unit is a text (of a book, of a newspaper-article, etc.), with a bibliographic header looking at the examples of the solr-tutorial and the central concept of the "field", i am a bit confused how to map these on one another, ie would the whole text be one field "text"
and the bibheader-items individual fields?
Yes, you could do that. What I did was: add the text as a whole in one field, add each chapter in it's own field, add metadata fields from a SQL database for each title (e.g. year=1966, author.name=Some one, author.placeofbirth=Somewhere). Basically, everything you want to explicitely search for/in you put in a separate field.

2. real full-text (no stop-words, really every word is indexed, possibly even all the punctuation)
Shouldn't be a problem I think.

3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys on the word-level,
ie for every word we also have its lemma-value and its PoS.
In dedicated systems, this is implemented either as verticale (each word in one line):
word   lemma   pos
...
or in newer systems with xml-attributes:
<w pos="Noun" lemma="tree">trees</w>

Important is, that it has to be possible to mix this various layers in one query, eg:
"word(some) lemma(nice) pos(Noun)"

This seems to me to be the biggest challenge for solr.
I'm not 100% sure what you are trying to do here, sorry.

4. indexing/searching-ratio
corpus is very static: the selection fo texts changes perhaps once a year (in production environment), so it doesnt really matter how long the indexing takes. Regarding the speed the emphasis is on the searches, which have to be "fast", exact and the results have to be further processable (kwic-view
possible (though it cuts off searching the text for keywords after 50Kb. Actually, Lucene does that and it is configurable, but it can be annoying, so you might have to hack that if you find that Solr doesn't return a kwic-index for a hit. But maybe I'm not using Solr the right way ;-). )
, thinning the solution,
possible
sorting,
possible
export,
Not sure what you mean here, but Solr just returns a XML document that you can process any way you like.
etc.). "Fast" is important also for more complex queries (ranges, boolean operators and prefixes mixed) and i say 10-15 seconds is the upper limit, which should be rather an exception to the rule of ~ 1 second.

5. also to regard the size: we are talking of multiples of 100 millions of tokens. the canonic example British National Corpus is 100 million, there are corpora with 2 billions tokens
That's a lot of text. I find Solr performs very well, but I can't guarantee you that Solr will work in your case, other more knowledgable people might be able to though.

Good luck with your decision making!

Kind regards,

Huib Verweij.

Reply via email to