Re: Problem with german hyphenated words not being found

Thomas Michael Engelke Thu, 11 Jun 2015 06:14:41 -0700

 Thank you for your input. Here's how the query looks with
debugQuery=true:


"rawquerystring": "name:industrie-anhänger",
 "querystring": "name:industrie-anhänger",
 "parsedquery": "MultiPhraseQuery(name:"(industrie-anhang industri)
(anhang industrieanhang)")",
 "parsedquery_toString": "name:"(industrie-anhang industri) (anhang
industrieanhang)"",

 It looks like there are some rules applied, expressed by the braces.
What's the correct interpretation of that? The default operator is OR,
yet this looks like the terms inside the braces group using AND.

Am 11.06.2015 12:40 schrieb Upayavira: 

> The next thing to do is add debugQuery=true to your URL (or enable it in
> the query pane of the admin UI). Then look for the parsed query info.
> 
> On the standard text_en field which includes an English stop word
> filter, I ran a query on "Jack and Jill's House" which showed
> this output:
> 
> "rawquerystring": "text_en:(Jack and Jill's House)", "querystring":
> "text_en:(Jack and Jill's House)", "parsedquery": "text_en:jack
> text_en:jill text_en:hous", "parsedquery_toString": "text_en:jack
> text_en:jill text_en:hous",
> 
> You can see that the parsed query is formed *after* analysis, so you can
> see exactly what is being queried for.
> 
> Also, as a corollary to this, you can use the schema browser (or
> faceting for that matter) to view what terms are being indexed, to see
> if they should match.
> 
> HTH
> 
> Upayavira
> 
>> Am 11.06.2015 12:00 schrieb Upayavira:
> Have you used the analysis tab in the admin UI? You can type in

sentences for both index and query time and see how they would be
analysed by various fields/field types.

Once you have got index time and query time to result in the same tokens
at the end of the analysis chain, you should start seeing matches in
your queries.

Upayavira

On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote:

> Hey, in german, you can string most nouns together by using hyphens, like 
> this: Industrie = industry Anhänger = trailer Industrie- Anhänger = trailer 
> for industrial use Here [1[1]], you can see me querying "Industrieanhänger" 
> from the "name" field (name:Industrieanhänger), to make sure the index 
> actually contains the word. Our data is structured that products are listed 
> without the hyphen. Now, customers can come around and use the hyphenated 
> version as a search term (i.e."industrie-anhänger"), and of course we want 
> them to find what they are looking for. I've set it up so that the 
> WordDelimiterFilterFactory uses catenateWords="1", so that these words are 
> catenated. An analysis of "Industrieanhänger" as index and 
> "industrie-anhänger" as query can be seen here [2[2]]. You can see that both 
> word parts are found. However, querying for "industrie- anhänger" does not 
> yield results, only when the hyphen is removed, as you can see here [3[3]]. 
> I'm not sure how to proceed from
here, as the results of the analysis have so far always lined up with what I 
could see when querying. Here's the schema definition for "text", the field 
type for the "name" field: <fieldType name="text" class="solr.TextField" 
positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer 
type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter 
class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" 
splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" 
catenateWords="1" catenateNumbers="0" catenateAll="0" preserveOriginal="1"/> 
<filter class="solr.LowerCaseFilterFactory"/> <filter 
class="solr.DictionaryCompoundWordTokenFilterFactory" 
dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3" 
maxSubwordSize="30" onlyLongestMatch="false"/> <filter 
class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true" 
enablePositionIncrements="true" format="snowball"/> <filter 
class="solr.GermanNormalizationFilterFactory"/> <filter
class="solr.SnowballPorterFilterFactory" language="German2" 
protected="protwords.txt"/> <filter 
class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer 
type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter 
class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" 
splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" 
catenateWords="1" catenateNumbers="0" catenateAll="0" preserveOriginal="1"/> 
<filter class="solr.LowerCaseFilterFactory"/> <!-- <filter 
class="solr.DictionaryCompoundWordTokenFilterFactory" 
dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3" 
maxSubwordSize="30" onlyLongestMatch="false"/> --> <filter 
class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true" 
enablePositionIncrements="true" format="snowball"/> <filter 
class="solr.GermanNormalizationFilterFactory"/> <filter 
class="solr.SnowballPorterFilterFactory" language="German2" 
protected="protwords.txt"/> <filter
class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> I've 
also thought it might be a problem with URL encoding not encoding the hyphen, 
but replacing it with %2D didn't change the outcome (and was probably wrong 
anyway). Any help is greatly appreciated. Links: ------ [1] 
http://imgur.com/2oEC5vz [1] [2] http://i.imgur.com/H0AhEsF.png [2] [3] 
http://imgur.com/dzmMe7t [3]

Links:

 1. http://imgur.com/2oEC5vz [1]
 2. http://i.imgur.com/H0AhEsF.png [2]
 3. http://imgur.com/dzmMe7t [3]

 

Links:
------
[1] http://imgur.com/2oEC5vz
[2] http://i.imgur.com/H0AhEsF.png
[3] http://imgur.com/dzmMe7t

Re: Problem with german hyphenated words not being found

Reply via email to