Re: solr for corpus?

Otis Gospodnetic Thu, 17 May 2007 09:32:27 -0700

Another thing that might be handy here is Token's often-forgotten type 
attribute.
The current default is:
  String type = "word";                           // lexical type


But you can set it via the constructor, which is something you would do from 
the custom Analyzer/Tokenizer that Hoss is describing:

  /** Constructs a Token with the given text, start and end offsets, & type. */
  public Token(String text, int start, int end, String typ) {


Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lucene Consulting -- http://lucene-consulting.com/


----- Original Message ----
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, May 14, 2007 6:40:32 PM
Subject: Re: solr for corpus?


: 3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys
: on the word-level,
: ie for every word we also have its lemma-value and its PoS.

: or in newer systems with xml-attributes:
: <w pos="Noun" lemma="tree">trees</w>
:
: Important is, that it has to be possible to mix this various layers in
: one query, eg:
: "word(some) lemma(nice) pos(Noun)"

the best way to approach this would probably be to preprocess the data nad
use a custom analyzer ... send it to solr with all of the info encoded in
each word, (ie: trees__tree_Noun) and then have a custom indexing analyzer
create multiple tokens in each position with an easy way to distinguish
wether a token is a word, the Lemma for a word, or the POS for word (ie:
the regular word plain, the Lemma prefixed by two underscores, and the POS
indexed by a single understore) then at query time if you know you are
looking for the phrase "some nice trees" you would search for "some nice
trees" but if you are looking for the word "some" followed by a word whose
lemma is "nice" followed by any Noun, you would search for "some __nice _Noun"

: This seems to me to be the biggest challenge for solr.

yeah ... neither Solr nor Lucene really attempt to tackly complex query
forms like this ... but Lucene has recently added a Token Payload
mechanism in an attempt to make queries like this easier (allowing
annotation of the actual terms that can be queried instead of needing to
create artificial terms in identical positions)

: corpus is very static: the selection fo texts changes perhaps once a
: year (in production environment),
: so it doesnt really matter how long the indexing takes. Regarding the
: speed the emphasis is on the searches,
: which have to be "fast", exact and the results have to be further
: processable (kwic-view, thinning the solution, sorting, export, etc.).
: "Fast" is important also for more complex queries (ranges, boolean
: operators and prefixes mixed)

these things should all be decent, especially since your index will be
fairly static so you don't have to worry baout 'warming' FieldCaches for
sorting etc.... something you might wnat to consider if you find query
speeds unacceptible on your full corpus with stop words left in would be
to sacrifice disk for speed by creating another field where the stop words
are removed and using it as much as possible (ie: anytime a query doesn't
care about stop words). ... but i wouldn't worry abotu that unless you
find it's actually a problem.  i've yet to see a complaint from anyone
that Solr isn't fast enough unless they are doing heavy faceting, or
updating their index so frequently that the caches can't be used.


-Hoss

Re: solr for corpus?

Reply via email to