Another thing that might be handy here is Token's often-forgotten type attribute. The current default is: String type = "word"; // lexical type
But you can set it via the constructor, which is something you would do from the custom Analyzer/Tokenizer that Hoss is describing: /** Constructs a Token with the given text, start and end offsets, & type. */ public Token(String text, int start, int end, String typ) { Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lucene Consulting -- http://lucene-consulting.com/ ----- Original Message ---- From: Chris Hostetter <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, May 14, 2007 6:40:32 PM Subject: Re: solr for corpus? : 3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys : on the word-level, : ie for every word we also have its lemma-value and its PoS. : or in newer systems with xml-attributes: : <w pos="Noun" lemma="tree">trees</w> : : Important is, that it has to be possible to mix this various layers in : one query, eg: : "word(some) lemma(nice) pos(Noun)" the best way to approach this would probably be to preprocess the data nad use a custom analyzer ... send it to solr with all of the info encoded in each word, (ie: trees__tree_Noun) and then have a custom indexing analyzer create multiple tokens in each position with an easy way to distinguish wether a token is a word, the Lemma for a word, or the POS for word (ie: the regular word plain, the Lemma prefixed by two underscores, and the POS indexed by a single understore) then at query time if you know you are looking for the phrase "some nice trees" you would search for "some nice trees" but if you are looking for the word "some" followed by a word whose lemma is "nice" followed by any Noun, you would search for "some __nice _Noun" : This seems to me to be the biggest challenge for solr. yeah ... neither Solr nor Lucene really attempt to tackly complex query forms like this ... but Lucene has recently added a Token Payload mechanism in an attempt to make queries like this easier (allowing annotation of the actual terms that can be queried instead of needing to create artificial terms in identical positions) : corpus is very static: the selection fo texts changes perhaps once a : year (in production environment), : so it doesnt really matter how long the indexing takes. Regarding the : speed the emphasis is on the searches, : which have to be "fast", exact and the results have to be further : processable (kwic-view, thinning the solution, sorting, export, etc.). : "Fast" is important also for more complex queries (ranges, boolean : operators and prefixes mixed) these things should all be decent, especially since your index will be fairly static so you don't have to worry baout 'warming' FieldCaches for sorting etc.... something you might wnat to consider if you find query speeds unacceptible on your full corpus with stop words left in would be to sacrifice disk for speed by creating another field where the stop words are removed and using it as much as possible (ie: anytime a query doesn't care about stop words). ... but i wouldn't worry abotu that unless you find it's actually a problem. i've yet to see a complaint from anyone that Solr isn't fast enough unless they are doing heavy faceting, or updating their index so frequently that the caches can't be used. -Hoss