Regarding the Lemma and PoS-tag requirement: you might handle this by inserting each word as its own document, with "lemma", "pos", and "word" fields, thereby allowing you lots of search flexibility. You could also include ID fields for the item and (if necessary) part (chapter etc.) and use these as facets, allowing you to group results by the items that contain them. Your application would have to know how to use the item ID value to retrieve the full item-level record.
These word-level records could live in a separate index or in the main index (since there are no required fields in Solr, you can have entirely different record structures in a single index; you just have to structure your queries accordingly). The problem will be that because your word-level entries are separate from your item-level entries, you'll have to include in the word-level entries any item-level fields that you want to be able to use in word-level queries (e.g. if you wanted to be able to limit a lemma search by date). The alternative would be to insert the lemma/pos/word entries in a multivalued string field and come up with more complex wildcard query structures to get at them. Apparently you can now get queries with leading and trailing wildcards to work, so you should be able to do everything you need, but I don't know how the performance will be. All the best, Peter -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Saturday, May 12, 2007 11:28 AM To: solr-user@lucene.apache.org Subject: solr for corpus? i test solr as one of potential tools for the purpose of building a linguistic corpus. i'd like to have your opinion, to which extent it would be a good choice. the specifics, which i find deviate from the typical use of solr, are: 1. basic unit is a text (of a book, of a newspaper-article, etc.), with a bibliographic header looking at the examples of the solr-tutorial and the central concept of the "field", i am a bit confused how to map these on one another, ie would the whole text be one field "text" and the bibheader-items individual fields? 2. real full-text (no stop-words, really every word is indexed, possibly even all the punctuation) 3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys on the word-level, ie for every word we also have its lemma-value and its PoS. In dedicated systems, this is implemented either as verticale (each word in one line): word lemma pos ... or in newer systems with xml-attributes: <w pos="Noun" lemma="tree">trees</w> Important is, that it has to be possible to mix this various layers in one query, eg: "word(some) lemma(nice) pos(Noun)" This seems to me to be the biggest challenge for solr. 4. indexing/searching-ratio corpus is very static: the selection fo texts changes perhaps once a year (in production environment), so it doesnt really matter how long the indexing takes. Regarding the speed the emphasis is on the searches, which have to be "fast", exact and the results have to be further processable (kwic-view, thinning the solution, sorting, export, etc.). "Fast" is important also for more complex queries (ranges, boolean operators and prefixes mixed) and i say 10-15 seconds is the upper limit, which should be rather an exception to the rule of ~ 1 second. 5. also to regard the size: we are talking of multiples of 100 millions of tokens. the canonic example British National Corpus is 100 million, there are corpora with 2 billions tokens thank you in advance regards matej