Re: strange results with query and hyphened words

Sascha Szott Mon, 31 May 2010 04:07:41 -0700

Sorry Markus, I mixed up the index and query field in analysis.jsp. Infact, I meant that a search for profiauskunft matches profi-auskunft.

I'm not sure, whether the case you are dealing with (search forprofi-auskunft should match profiauskunft) is appropriately addressed bythe WordDelimiterFilter. What about using the PatternReplaceCharFilterat query time to eliminate all intra-word hyphens?


-Sascha

Sascha Szott wrote:

Hi Markus,

the default-config for index is:

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>

and for query:

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="0" catenateNumbers="0"
catenateAll="0"/>

That's not true. The default configuration for query-time processing is:

<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="1"/>

By using this setting, a search for "profi-auskunft" will match
"profiauskunft".

It's important to note, that WordDelimiterFilterFactory's catenate*
parameters should only be used in the index-time analysis stack.
Otherwise the strange behaviour (search for profi-auskunft is translated
into "profi followed by (auskunft or profiauskunft)" you mentioned will
occur.

Best,
Sascha

-----Ursprüngliche Nachricht-----
Von: Sascha Szott [mailto:sz...@zib.de]
Gesendet: Sonntag, 30. Mai 2010 19:01
An: solr-user@lucene.apache.org
Betreff: Re: strange results with query and hyphened words

Hi Markus,

I was facing the same problem a few days ago and found an
explanation in
the mail archive that clarifies my question regarding the usage of
Solr's WordDelimiterFilterFactory:

http://markmail.org/message/qoby6kneedtwd42h

Best,
Sascha

markus.rietz...@rzf.fin-nrw.de wrote:

i am wondering why a search term with hyphen doesn't match.

my search term is "prof-auskunft". in

WordDelimiterFilterFactory i have

catenateWords, so my understanding is that profi-auskunft

would search

for profiauskunft. when i use the analyse panel in solr

admi i see that

profi-auskunft matches a term "profiauskunft".

the analyse will show

Query Analyzer
WhitespaceTokenizerFactory
profi-auskunft
SynonymFilterFactory
profi-auskunft
StopFilterFactory
profi-auskunft

WordDelimiterFilterFactory

term position 1 2
term text profi auskunft
profiauskunft
term type word word
word
source start,end 0,5 6,14
0,15

LowerCaseFilterFactory
SnowballPorterFilterFactory

why is auskunft and profiauskunft in one column. how do they get
searched?

when i search "profiauskunft" i have 230 hits, when i now search for
"profi-auskunft" i do get less hits. when i call the search with
debugQuery=on i see

body:"profi (auskunft profiauskunft)"

what does this query mean? profi and "auskunft or profiauskunft"?




<fieldType name="text_de" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- sg324 bei wortern die durch - und weitere leerzeichen
getrennt sind, werden diese zusammengefuehrt. -->
<filter class="solr.HiphenatedWordsFilterFactory"/>
<!-- in this example, we will only use synonyms at

query time

<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms_de.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the

index and query

analyzers to leave a 'gap' for more accurate

phrase queries.

-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="de/stopwords_de.txt"
enablePositionIncrements="true"
/>
<!-- sg324 -->
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="German" protected="de/protwords_de.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="de/synonyms_de.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="de/stopwords_de.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="German" protected="de/protwords_de.txt"/>
</analyzer>
</fieldType>

Re: strange results with query and hyphened words

Reply via email to