Regarding the Lemma and PoS-tag requirement: you might handle this by
inserting each word as its own document, with "lemma", "pos", and "word"
fields, thereby allowing you lots of search flexibility. You could also
include ID fields for the item and (if necessary) part (chapter etc.)
and use these as facets, allowing you to group results by the items that
contain them. Your application would have to know how to use the item ID
value to retrieve the full item-level record.

These word-level records could live in a separate index or in the main
index (since there are no required fields in Solr, you can have entirely
different record structures in a single index; you just have to
structure your queries accordingly). The problem will be that because
your word-level entries are separate from your item-level entries,
you'll have to include in the word-level entries any item-level fields
that you want to be able to use in word-level queries (e.g. if you
wanted to be able to limit a lemma search by date).  

The alternative would be to insert the lemma/pos/word entries in a
multivalued string field and come up with more complex wildcard query
structures to get at them. Apparently you can now get queries with
leading and trailing wildcards to work, so you should be able to do
everything you need, but I don't know how the performance will be.

All the best,

Peter

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 12, 2007 11:28 AM
To: solr-user@lucene.apache.org
Subject: solr for corpus?

i test solr as one of potential tools for the purpose of building a
linguistic corpus.

i'd like to have your opinion, to which extent it would be a good
choice.

the specifics, which i find deviate from the typical use of solr, are:

1. basic unit is a text (of a book, of a newspaper-article, etc.), with
a bibliographic header looking at the examples of the solr-tutorial and
the central concept of the "field", i am a bit confused how to map these
on one another, ie would the whole text be one field "text"
and the bibheader-items individual fields?

2. real full-text (no stop-words, really every word is indexed, possibly
even all the punctuation)

3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys
on the word-level, ie for every word we also have its lemma-value and
its PoS.
In dedicated systems, this is implemented either as verticale (each word
in one line):
word   lemma   pos
...
or in newer systems with xml-attributes:
<w pos="Noun" lemma="tree">trees</w>

Important is, that it has to be possible to mix this various layers in
one query, eg:
"word(some) lemma(nice) pos(Noun)"

This seems to me to be the biggest challenge for solr.

4. indexing/searching-ratio
corpus is very static: the selection fo texts changes perhaps once a
year (in production environment), so it doesnt really matter how long
the indexing takes. Regarding the speed the emphasis is on the searches,
which have to be "fast", exact and the results have to be further
processable (kwic-view, thinning the solution, sorting, export, etc.). 
"Fast" is important also for more complex queries (ranges, boolean
operators and prefixes mixed) and i say 10-15 seconds is the upper
limit, which should be rather an exception to the rule of ~ 1 second.

5. also to regard the size: we are talking of multiples of 100 millions
of tokens.
the canonic example British National Corpus is 100 million, there are
corpora with 2 billions tokens



thank you in advance

regards
matej

Reply via email to