You can gain a lot of insight into this kind of thing with the
admin/analysis page. Often the issue is that your tokenizing/
filtering isn't doing quite what you think. Try turning on the
debug checkboxes on that page and seeing what tokens are
generated at index and analysis page.

In particular, WordDelimiterFactory is often a surprise in how
it splits and recombines tokens. Including synonyms is another
potential issue. Not to mention the EnglishPorterFilterFactory.

If that's not helpful, could you paste some examples that you expect
to match that don't?

Best
Erick

On Thu, Dec 30, 2010 at 8:04 PM, Scott Gonyea <sc...@aitrus.org> wrote:

> Hi,
>
> I am trying to make sure that when I search for text—regardless of
> what that text is—that I get an exact match.  I'm *still* getting some
> issues, and this last mile is becoming very painful.  The solr field,
> for which I'm setting this up on, is pasted below my explanation.  I
> appreciate any help.
>
> Explanation:
>
> I'm crawling websites with Nutch.  I'm performing some
> mechanical-turk-like filtering and term matching.  The problem is,
> there's some very gnarly behavior in Solr due to any number of
> gotchas.
>
> If I want to find *all* Solr documents that match
> "[id]somejunk\hi[/id]" then life is instantly hell.
>
> Likewise, lots of whitespace in between words throws it off " john
> says hello,  how are you?"  I would love to be able to search for
> these exact phrases.  If that's just not practical (I'm more than
> willing to live with a bloated search index), what would some other
> strategies be?
>
> There's no MapReduce in Solr; I could attempt to do Hadoop-streaming,
> but that's not very ideal for a variety of reasons.
>
>
> Solr Schema.xml, fieldType "text" (no, this is not used everywhere;
> only on 2 fields):
>
>
>    <fieldType name="text"    class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"     generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>
> Thank you,
> Scott Gonyea
>

Reply via email to