Re: Exact matches

Erick Erickson Sun, 09 Feb 2014 10:28:38 -0800

Whoa! My first bit of advice is to spend some time getting familiar
with the admin>>analysis page, because I suspect you're not
doing what you expect.

1> KeywordTokenizer does NOT break up the input stream, so an
input of "sony xperia c price" gets tokenized as "sony xperia c price",
NOT the words "sony" "xperia" "c" and "price".

2> You use PatternReplace to remove the punctuation etc.

3> You use EdgeNGrams to create tokens like
s, so, son, sony. But then you do NOT use EdgeNGrams in your
query section. So your queries are probably not very robust. The
NGrams are why your matching is odd.

At the end of all this, you have a single string that gets n-grammed,
then an additional PatternReplace is done. I don't think, for instance,
that you will be unable to search for "xperia" and get a hit. I rather
doubt that's what you want, but you know better than me.

So it looks to me like you started out using KeywordTokenizer and then
added a bunch of filters to try to make your results what you expect. It's
possible that the decision to use KeywordTokenizer led you down an
overly-complex path.

I'd start with one of the other tokenizers that breaks things up on
input, e.t. StandardTokenizer, WhitespaceTokenizer, etc., and build up
the analysis chain (e.g. Filters) again, although I notice you have some
CJK characters in your PatternReplace, so whitespace may not be
suitable. If you are analyzing CJK text, there are tokenizers built for that.

All that said, you know your problem space waaaay better than me, so this
may all be complete nonsense.....

Best,
Erick

On Sun, Feb 9, 2014 at 9:17 AM, kumar <pavan2...@gmail.com> wrote:
> Hi,
>
> Whenever user types the search query like
>
>
> "sony xperia c" it has to match the results like
>
> sony xperia c price
> sony xperia c reviews
> sony xperia c photos
>
> but my search query displays
>
> Sony xperia act mobiles
> sony xperia ace mobiles
> sony xperia abc mobiles
>
>
>
> Can anybody help me how to do it.
>
> My schema is like the following....
>
>
>
> <field name="my_title" type="text_full" indexed="true" stored="false"
> multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />
>
>
>
> <fieldType name="text_full" class="solr.TextField">
>     <analyzer type="index">
>         <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="([\.,;:-_])" replacement=" " replace="all"/>
>         <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30"
> minGramSize="1"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="([^\w\d\*繥ǘŠ])" replacement="" replace="all"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>     </analyzer>
>     <analyzer type="query">
>         <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="([\.,;:-_])" replacement=" " replace="all"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="([^\w\d\*繥ǘŠ])" replacement="" replace="all"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(.{30})(.*)?" replacement="$1" replace="all"/>
>         <filter class="solr.SynonymFilterFactory" ignoreCase="true"
> synonyms="synonyms_fsw.txt" expand="true" />
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>     </analyzer>
> </fieldType>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Exact-matches-tp4116340.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Exact matches

Reply via email to