RE: Tokenization and wild card search

johnmunir Tue, 19 Jan 2010 06:51:23 -0800


I want the following searches to work:
 
  MyField:SDD_Expedition_PCB
 
This should match the word "SDD_Expedition_PCB" only, and not matching 
individual words such as "SDD" or "Expedition", or "PCB".


And the following search:
 
  MyField:SDD_Expedition*
 
Should match any word starting with "SDD_Expedition" and ending with anything 
else such as "SDD_Expedition_PBC", "SDD_Expedition_One", "SDD_Expedition_Two", 
"SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc, but not matching individual 
words such as "SDD" or "Expedition".
 

The field type for "MyField" is (the field name is keywords):
 
    <field name="Keywords" type="text" indexed="true" stored="false" 
required="false" multiValued="true"></field>
 
And here is the analyzer I'm using:
 
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" 
ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/> -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
 
Any help on how I can achieve the above is greatly appreciated.
 
Btw, if at all possible, I would like to be able to achieve this search without 
having to change how I'm indexing / tokenizing the data.  I'm looking for 
search syntax to make this work.
 
-- JM
 
-----Original Message-----
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Tuesday, January 19, 2010 7:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Tokenization and wild card search
 
> I have an issue and I'm not sure how to address it, so I
> hope someone can help me.
>  
> I have the following text in one of my fields:
> "ABC_Expedition_ERROR".���When I search on it
> like: "MyField:SDD_Expedition_PCB" (without quotes) it will
> fail to find me only this word �ABC_Expedition_ERROR�
> which I think is due to tokenization because of the
> underscore.
 
Do you want or do not want your query MyField:SDD_Expedition_PCB to return 
documents containing ABC_Expedition_ERROR?
 
> My solution is: "MyField:"SDD_Expedition_PCB"" (without the
> outer quotes, but quotes around the word
> �ABC_Expedition_ERROR�).� This works fine.�
> But then, how do I search on "SDD_Expedition_PCB" with wild
> card?� For example: "MyField:SDD_Expedition*" will not
> work.
 
Can you paste your field type of MyField? And give some examples what queries 
should return what documents.

RE: Tokenization and wild card search

Reply via email to