Re: Internals of Analysis and Token Matching

Alexandre Rafalovitch Mon, 17 Nov 2014 13:59:03 -0800

Are you trying to match phone numbers despite the
spaces/dashes/brackets? By prefix? Suffix?


If so, you may look at something more like:

<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])"
replacement="" replace="all"/>

And remember, if you are using ngrams, you probably want them in the
index-chain of the analyzer, but not in the query-chain. Otherwise,
you will be matching on anything that has 3 characters overlapping.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 17 November 2014 16:43, Pritesh Patel <priteshpate...@gmail.com> wrote:
> Hi Community.
>
> Hoping someone can help explain this ...
>
> Once all the analysis is done on a field all the tokens to identify that
> field are stored.  What else is affecting a match to the document beyond a
> simple token match and frequency of terms that match?
>
> All the searches I did produce the same tokens (verified by using the
> analysis screen in the admin, and looking at the terms indexed in solr
> through the schema browser for field).  But some match and some don't when
> I actually do the search.  I don't know why some of the searches don't
> match even though everything in the analysis tells me they have the same
> tokens.  What am I missing?
>
> *Descriptions*
>
> *Indexed in a field*: "4048860461"
>
> *Searches that Match*
> "4048860461"
> "(404)8860461"
>
> *Searches that don't match*
> "404-886-0461"
> "404)8860461"
> "404)886)0461"
>
> *Field analysis*
> Field analysis is pretty simple, just used the "text_en_splitting_tight"
> field but added an "ngram" filter to it.  See below.
>
> <fieldType name="text_en_splitting_tight_ngram" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer> <
> tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=
> "solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand
> ="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words=
> "lang/stopwords_en.txt"/> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/> <filter class=
> "solr.LowerCaseFilterFactory"/> <filter class=
> "solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class=
> "solr.EnglishMinimalStemFilterFactory"/> <filter class=
> "solr.NGramFilterFactory" minGramSize="3" maxGramSize="20"/> <!-- this
> filter can remove any duplicate tokens that appear at the same position -
> sometimes possible with WordDelimiterFilter in conjuncton with stemming. -->
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </
> fieldType>

Re: Internals of Analysis and Token Matching

Reply via email to