Re: solr for corpus?

Huib Verweij Mon, 14 May 2007 12:14:06 -0700

Hi matej,

since I didn't see anyone answering your question yet, I'll have a go atit, but I'm not one of the Solr developers, I've just used it so far andam very happy with it. I use it for searching literary texts, storinginformation from a SQL database in the Solr documents as metadata forthe texts.



[EMAIL PROTECTED] schreef:

i test solr as one of potential tools for the purpose of building alinguistic corpus.
i'd like to have your opinion, to which extent it would be a good choice.

the specifics, which i find deviate from the typical use of solr, are:
1. basic unit is a text (of a book, of a newspaper-article, etc.),with a bibliographic headerlooking at the examples of the solr-tutorial and the central conceptof the "field",i am a bit confused how to map these on one another, ie would thewhole text be one field "text"
and the bibheader-items individual fields?

Yes, you could do that. What I did was: add the text as a whole in onefield, add each chapter in it's own field, add metadata fields from aSQL database for each title (e.g. year=1966, author.name=Some one,author.placeofbirth=Somewhere). Basically, everything you want toexplicitely search for/in you put in a separate field.

2. real full-text (no stop-words, really every word is indexed,possibly even all the punctuation)

Shouldn't be a problem I think.

3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keyson the word-level,
ie for every word we also have its lemma-value and its PoS.
In dedicated systems, this is implemented either as verticale (eachword in one line):
word   lemma   pos
...
or in newer systems with xml-attributes:
<w pos="Noun" lemma="tree">trees</w>
Important is, that it has to be possible to mix this various layers inone query, eg:
"word(some) lemma(nice) pos(Noun)"

This seems to me to be the biggest challenge for solr.

I'm not 100% sure what you are trying to do here, sorry.

4. indexing/searching-ratio
corpus is very static: the selection fo texts changes perhaps once ayear (in production environment),so it doesnt really matter how long the indexing takes. Regarding thespeed the emphasis is on the searches,which have to be "fast", exact and the results have to be furtherprocessable (kwic-view

possible (though it cuts off searching the text for keywords after 50Kb.Actually, Lucene does that and it is configurable, but it can beannoying, so you might have to hack that if you find that Solr doesn'treturn a kwic-index for a hit. But maybe I'm not using Solr the rightway ;-). )

, thinning the solution,

possible

sorting,

possible

export,

Not sure what you mean here, but Solr just returns a XML document thatyou can process any way you like.

etc.). "Fast" is important also for more complex queries (ranges,boolean operators and prefixes mixed)and i say 10-15 seconds is the upper limit, which should be rather anexception to the rule of ~ 1 second.
5. also to regard the size: we are talking of multiples of 100millions of tokens.the canonic example British National Corpus is 100 million, there arecorpora with 2 billions tokens

That's a lot of text. I find Solr performs very well, but I can'tguarantee you that Solr will work in your case, other more knowledgablepeople might be able to though.


Good luck with your decision making!

Kind regards,

Huib Verweij.

Re: solr for corpus?

Reply via email to