Re: Tokenization and wild card search

johnmunir Tue, 19 Jan 2010 08:12:37 -0800


You are correct, the way I'm using tokenization is my issue.  It's too late to 
re-index now, this is why I'm looking for a search syntax that will to make the 
search work.
 
I have tried various search syntax with no luck.  Is there no search syntax to 
make this work without re-indexing?!
 
-- JM



-----Original Message-----
From: Erick Erickson <[email protected]>
To: [email protected]
Sent: Tue, Jan 19, 2010 10:30 am
Subject: Re: Tokenization and wild card search


I'm pretty sure you're going to be disappointed about
he re-indexing part.
I'm pretty sure that WordDelimiterFilterFactory is tokenizing
our input in ways you don't expect, making your use-case
ard to accomplish.
It's basically splitting your input on all non-alpha characters,
o you're indexing see
ttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
I'd *strongly* suggest you examine the results of your indexing
n order to understand what's possible.
Get a copy of luke and examine your index or use the
OLR admin Analysis page...
I suspect what you're really looking for is WhitespaceAnalyzer
r Keyword
On Tue, Jan 19, 2010 at 9:50 AM, <[email protected]> wrote:
>

 I want the following searches to work:

  MyField:SDD_Expedition_PCB

 This should match the word "SDD_Expedition_PCB" only, and not matching
 individual words such as "SDD" or "Expedition", or "PCB".

 And the following search:

  MyField:SDD_Expedition*

 Should match any word starting with "SDD_Expedition" and ending with
 anything else such as "SDD_Expedition_PBC", "SDD_Expedition_One",
 "SDD_Expedition_Two", "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc,
 but not matching individual words such as "SDD" or "Expedition".


 The field type for "MyField" is (the field name is keywords):

    <field name="Keywords" type="text" indexed="true" stored="false"
 required="false" multiValued="true"></field>

 And here is the analyzer I'm using:

    <fieldType name="text" class="solr.TextField"
 positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
 synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
 words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
 generateWordParts="0" generateNumberParts="1" catenateWords="1"
 catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
 protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- <filter class="solr.SynonymFilterFactory"
 synonyms="synonyms.txt" ignoreCase="true" expand="true"/> -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
 words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
 generateWordParts="0" generateNumberParts="1" catenateWords="1"
 catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
 protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

 Any help on how I can achieve the above is greatly appreciated.

 Btw, if at all possible, I would like to be able to achieve this search
 without having to change how I'm indexing / tokenizing the data.  I'm
 looking for search syntax to make this work.

 -- JM

 -----Original Message-----
 From: Ahmet Arslan [mailto:[email protected]]
 Sent: Tuesday, January 19, 2010 7:57 AM
 To: [email protected]
 Subject: Re: Tokenization and wild card search

 > I have an issue and I'm not sure how to address it, so I
 > hope someone can help me.
 >
 > I have the following text in one of my fields:
 > "ABC_Expedition_ERROR".���When I search on it
 > like: "MyField:SDD_Expedition_PCB" (without quotes) it will
 > fail to find me only this word �ABC_Expedition_ERROR�
 > which I think is due to tokenization because of the
 > underscore.

 Do you want or do not want your query MyField:SDD_Expedition_PCB to return
 documents containing ABC_Expedition_ERROR?

 > My solution is: "MyField:"SDD_Expedition_PCB"" (without the
 > outer quotes, but quotes around the word
 > �ABC_Expedition_ERROR�).� This works fine.�
 > But then, how do I search on "SDD_Expedition_PCB" with wild
 > card?� For example: "MyField:SDD_Expedition*" will not
 > work.

 Can you paste your field type of MyField? And give some examples what
 queries should return what documents.

Re: Tokenization and wild card search

Reply via email to