Re: Exact search with special characters

Jack Krupansky Mon, 25 Aug 2014 05:21:55 -0700

To be honest, I'm not precisely sure what Google is really doing under thehood since there is no detailed spec publically available. We know thatquotes do force a phrase searchin Google, but do they disable stemming orpreserve case and special characters? Unknown. Although, my PERCEPTION ofGoogle is that it does disable stemming but continues to be case insensitiveand ignore special characters in quoted phrases, but I don't see thatbehavior documented for search help in Google. IOW, trying to fall back on aprecise definition from Google won't help us here. IOW, we don't have aclear view of "Exact search with special characters" for Google itself.

Bottom line: If you want to search both with and without special characters,that will have to be done with separate fields with separate analyzers.

You could use the combination of the keyword tokenizer and the ngram filter(at index time only) to support what YOU SEEM to be calling "exact match",but then you will need to specify that separate field name in addition toquoting the phrase. Or, just use a string field and then do wildcard orregex queries on that field for whatever degree of "exactness" you require.


-- Jack Krupansky

-----Original Message-----From: Shay Sofer

Sent: Monday, August 25, 2014 8:02 AM
To: solr-user@lucene.apache.org
Subject: RE: Exact search with special characters

Hi,

Thanks for your reply.

I thought that google search work the same (quotes stand for exact match).

Example for my demands:
Objects:
- test host
- test_host
-test $host
-test-host

When I'll search for test host I'll get all above  results.

When I'll search for "test host" Ill get only test host

Also, when search for partial string like test / host I'll get all aboveresults.

Thanks.

-----Original Message-----
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Sunday, August 24, 2014 3:34 PM
To: solr-user@lucene.apache.org
Subject: Re: Exact search with special characters

What precisely do you mean by the term "exact search". I mean, Solr (and
Lucene) do not have that concept for tokenized text fields.

Or did you simply mean "quoted phrase". In which case, you need to be awarethat all the quotes do is assure that the terms occur in that order or inclose proximity according to the default or specified "phrase slop"distance. But each term is still analyzed according to the analyzer for thefield.

Technically, Lucene will in fact analyze the full quoted phrase as onestream, which for non-tokenized fields will be one term, but for anytokenized fields which split on white space, the phrase will be broken intoseparate tokens and special characters will tend to be removed as well. Thekeyword tokenizer will indeed treat the entire phrase as a single token, andthe white space tokenizer will preserve special characters, but the standardtokenizer will not preserve either white space or special characters.

Nominally, the keyword tokenizer does generate a single term at least at thetokenization stage, but the world delimiter filter then splits individualterms into multiple terms, thus guaranteeing that a phrase with white spacewill be multiple terms and special characters are removed as well.

The other technicality is that quoting a phrase does prevent the phrase frombeing interpreted as query parser syntax, such as AND and OR operators ortreating special characters as query parser operators.


But, the fact remains that a quoted phrase is not treated as an "exact"
string literal for any normal tokenized fields.

Out of curiosity, what references have lead you to believe that a quotedphrase is an "exact match"?

Use a "string" (not "tokenized text") field if you wish to make an "exactmatch" on a literal string, but the concept of "exact match" is notsupported for tokenized and filtered text fields.

So, please describe, in plain English, plus examples, exactly what youexpect your analyzer to do, both in terms of how it treats text to beindexed and how you expect to be able to query that text.

-- Jack Krupansky

-----Original Message-----
From: Shay Sofer
Sent: Sunday, August 24, 2014 5:58 AM
To: solr-user@lucene.apache.org
Subject: Exact search with special characters

Hi all,

I have a docs that's indexed by text field with mention schema.

I have those docs names:

-          Test host

-          Test_host

-          Test-host

-          Test $host

When I'm trying to do exact search like: "test host"
All the results from above are shown as a results.

How can I use exact match so I'll will get only one result?

I prefer to do my changes in search time but if I need to change my schemaplease offer that.


Thanks,
Shay.


This is my schema:
       <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
           <analyzer type="index">
               <tokenizer class="solr.KeywordTokenizerFactory"/>
               <filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0" splitOnCaseChange="0"
                       preserveOriginal="1"/>
               <filter class="solr.LowerCaseFilterFactory"/>
           </analyzer>
           <analyzer type="query">
               <tokenizer class="solr.KeywordTokenizerFactory"/>
               <filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0" splitOnCaseChange="0"
                       preserveOriginal="1"/>
               <filter class="solr.LowerCaseFilterFactory"/>
           </analyzer>
       </fieldType>

Email secured by Check Point

Re: Exact search with special characters

Reply via email to