Hi matej,
since I didn't see anyone answering your question yet, I'll have a go at
it, but I'm not one of the Solr developers, I've just used it so far and
am very happy with it. I use it for searching literary texts, storing
information from a SQL database in the Solr documents as metadata for
the texts.
[EMAIL PROTECTED] schreef:
i test solr as one of potential tools for the purpose of building a
linguistic corpus.
i'd like to have your opinion, to which extent it would be a good choice.
the specifics, which i find deviate from the typical use of solr, are:
1. basic unit is a text (of a book, of a newspaper-article, etc.),
with a bibliographic header
looking at the examples of the solr-tutorial and the central concept
of the "field",
i am a bit confused how to map these on one another, ie would the
whole text be one field "text"
and the bibheader-items individual fields?
Yes, you could do that. What I did was: add the text as a whole in one
field, add each chapter in it's own field, add metadata fields from a
SQL database for each title (e.g. year=1966, author.name=Some one,
author.placeofbirth=Somewhere). Basically, everything you want to
explicitely search for/in you put in a separate field.
2. real full-text (no stop-words, really every word is indexed,
possibly even all the punctuation)
Shouldn't be a problem I think.
3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys
on the word-level,
ie for every word we also have its lemma-value and its PoS.
In dedicated systems, this is implemented either as verticale (each
word in one line):
word lemma pos
...
or in newer systems with xml-attributes:
<w pos="Noun" lemma="tree">trees</w>
Important is, that it has to be possible to mix this various layers in
one query, eg:
"word(some) lemma(nice) pos(Noun)"
This seems to me to be the biggest challenge for solr.
I'm not 100% sure what you are trying to do here, sorry.
4. indexing/searching-ratio
corpus is very static: the selection fo texts changes perhaps once a
year (in production environment),
so it doesnt really matter how long the indexing takes. Regarding the
speed the emphasis is on the searches,
which have to be "fast", exact and the results have to be further
processable (kwic-view
possible (though it cuts off searching the text for keywords after 50Kb.
Actually, Lucene does that and it is configurable, but it can be
annoying, so you might have to hack that if you find that Solr doesn't
return a kwic-index for a hit. But maybe I'm not using Solr the right
way ;-). )
, thinning the solution,
possible
sorting,
possible
export,
Not sure what you mean here, but Solr just returns a XML document that
you can process any way you like.
etc.). "Fast" is important also for more complex queries (ranges,
boolean operators and prefixes mixed)
and i say 10-15 seconds is the upper limit, which should be rather an
exception to the rule of ~ 1 second.
5. also to regard the size: we are talking of multiples of 100
millions of tokens.
the canonic example British National Corpus is 100 million, there are
corpora with 2 billions tokens
That's a lot of text. I find Solr performs very well, but I can't
guarantee you that Solr will work in your case, other more knowledgable
people might be able to though.
Good luck with your decision making!
Kind regards,
Huib Verweij.