Re: Strange behavior on text field with number-text content

Michał Matulka Tue, 28 May 2013 05:54:27 -0700

Thanks for your responses, I must admit that after hours of trying I made some mistakes.
So the most problematic phrase will now be:
"4nSolution Inc." which cannot be found using query:

name:4nSolution

or even

name:4nSolution Inc.

but can be using following queries:

name:nSolution
name:4
name:inc

Sorry for the mess, it turned out I didn't reindex fields after modyfying schema so I thought that the problem also applies to 300letters .

The cause of all of this is the WordDelimiter filter defined as following:

<fieldType name="text" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        
        
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="1" splitOnCaseChange="0" preserveOriginal="1" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
      </analyzer>
    </fieldType>

and I still don't know why it behaves like that - after all there is "preserveOriginal" attribute set to 1...

W dniu 28.05.2013 14:21, Erick Erickson pisze:

Hmmm, with 4.x I get much different behavior than you're
describing, what version of Solr are you using?


Besides Alex's comments, try adding &debug=query to the url and see what comes
out from the query parser.

A quick glance at the code shows that DefaultAnalyzer is used, which doesn't do
any analysis, here's the javadoc...
 /**
   * Default analyzer for types that only produces 1 verbatim token...
   * A maximum size of chars to be read must be specified
   */

so it's much like the "string" type. Which means I'm totally perplexed by your
statement that 300 and letters return a hit. Have you perhaps changed the
field definition and not re-indexed?

The behavior you're seeing really looks like somehow WordDelimiterFilterFactory
is getting into your analysis chain with settings that don't mash the parts back
together, i.e. you can set up WDDF to split on letter/number transitions, index
each and NOT index the original, but I have no explanation for how that
could happen with the field definition you indicated....

FWIW,
Erick

On Tue, May 28, 2013 at 7:47 AM, Alexandre Rafalovitch
<arafa...@gmail.com> wrote:

 What does analyzer screen say in the Web AdminUI when you try to do that?
Also, what are the tokens stored in the field (also in Web AdminUI).

I think it is very strange to have TextField without a tokenizer chain.
Maybe you get a standard one assigned by default, but I don't know what the
standard chain would be.

Regards,

  Alex.
On 28 May 2013 04:44, "Michał Matulka" <michal.matu...@gowork.pl> wrote:

Hello,

I've got following problem. I have a text type in my schema and a field
"name" of that type.
That field contains a data, there is, for example, record that has
"300letters" as name.

Now field type definition:
<fieldType name="text" class="solr.TextField"></**fieldType>

And, of course, field definition:
<fieldname="name"type="text"**indexed="true"stored="true"/>

yes, that's all - there are no tokenizers.

And now time for my question:

Why following queries:

name:300

and

name:letters

are returning that result, but:

name:300letters

is not (0 results)?

Best regards,
Michał Matulka

Pozdrawiam,
Michał Matulka

Programista

michal.matu...@gowork.pl

ul. Zielna 39

00-108 Warszawa

Re: Strange behavior on text field with number-text content

Reply via email to