On 11/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:
I'm trying to do some sentence-level searching with Solr; basically, I
want to find if two words are in the same sentence. As I read on the
Lucene mailing list, there's many ways to do this, including but not
limited to :
-inserting special boundary terms to denote the start and end of a
sentence. It is unclear to me what kind of query should be used to fetch
results from within one sentence (something like: start_sentence_token
word1 word2 end_sentence_token)?
Span queries... but there isn't really query parser support for them.
-increase token position at a sentence boundary by a large factor
(1000?) so that "x y"~500 (or more) won't match across sentence boundaries.
That's probably the easiest and simplest.
Is there an existing filter class that I could use to do this, or should
I first parse my text fields with PHP and some NLP tool, and index the
result (for the first case)? For the second case (increment token
position), how should I do this within Solr?
Solr puts a configurable gap between values of the same field, so you
could index every sentence as a separate value of a multi-valued
field.
A better solution would be to have a tokenizer that could detect the
end of sentences and either insert a gap or a special token that
another filter could act on.
-Yonik