I've come across something like this as well, can't remember where, but it was often related to synonym functionality.
The following link shows a 3rd party QueryParser that seems to deal with synonyms alongside edismax, and may be interesting to look at: http://wiki.apache.org/solr/QueryParser It is also mentioned as an issue while using the synonymFilterFactory: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory "The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym". Maybe the extended support for synonym handling is what will give us the solution one day. For now I have solved my problem and will leave it at that. *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: [email protected] *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-30 Jack Krupansky <[email protected]>: > I vaguely recall that there was a Jira floating around for multi-word > synonyms that dealt with parsing of spaces as well. And Robert Muir has > (repeatedly) referred to this query parser feature as a "bug". Somehow, > eventually, I think it will be dealt with, but the "difficulty" remains for > now. > > -- Jack Krupansky > > -----Original Message----- From: Aleksander Akerø > Sent: Thursday, January 30, 2014 9:31 AM > > To: [email protected] > Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches > > Yes, I actually noted that about the filter vs. tokenizer. It's easy to get > confused if you don't have a good understanding of the differences between > tokenizers and filters. > > As for the query parser problem, there's always a workaround, but it was > nice to be made aware of. It sort of was a ghost-like problem before. > Allthough it would be great to have the opportunity to "disable" the > splitting on whitespace even for DisMax, I understand that it probably not > the most wanted feature for next solr release :) > > *Aleksander Akerø* > Systemkonsulent > Mobil: 944 89 054 > E-post: [email protected] > > *Gurusoft AS* > Telefon: 92 44 09 99 > Østre Kullerød > www.gurusoft.no > > > 2014-01-30 Erick Erickson <[email protected]>: > > Note, the comments about lowercasetokenizer were a red herring. You were >> using LowerCaseFilterFactory. note "Filter" rather than "Tokenizer". So it >> would >> just do what you expected, lowercase the entire input. You would have used >> LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a >> Filter. >> >> As for the rest, I expect Jack is right, it's the query parsing above >> the field input. >> >> Best >> Erick >> >> On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø >> <[email protected]> wrote: >> > Hi Srinivasa >> > >> > Yes I've come to understand that the analyzers will never "see" the >> > whitespace, thus no need for patternreplacement, like Jack points out. >> > So >> > the solution would be to set wich parser to use for the query. Also Jack >> > has pointed out that the "field" queryparser should work in this >> particular >> > setting -> http://wiki.apache.org/solr/QueryParser >> > >> > My problem was though, that it was only for one of the fields in the >> schema >> > that i needed this for, but for all the other fields, e.g. name, >> > description etc., I would very much like to make use of the eDisMax >> > functionality. And it seems that there can only be defined one query >> parser >> > per query. in other words: for all fields. Jack, you may correct me if >> I'm >> > wrong here :) >> > >> > This particular customer wanted a wildcard search at both ends of the >> > phrase, and that sort of ambiguated the problem. And therefore I chose >> > to >> > replace all whitespace for this field in sql at index time, using the >> DIH. >> > And then using EdgeNGramFilterFactory on both sides of the keyword like >> the >> > config below, and that seemed to work pretty nicely. >> > >> > <!-- #### WildCard search number #### --> <fieldType name="keyword" >> class= >> > "solr.TextField" positionIncrementGap="100"> <analyzer type="index"> < >> > tokenizer class="solr.KeywordTokenizerFactory"/> <filter class= >> > "solr.LowerCaseFilterFactory"/> <filter >> class="solr.EdgeNGramFilterFactory" >> > minGramSize="2" maxGramSize="25" side="front"/> <filter class= >> > "solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" >> side="back"/> >> > </analyzer> <analyzer type="query"> <tokenizer class= >> > "solr.KeywordTokenizerFactory"/> <filter >> class="solr.LowerCaseFilterFactory" >> > /> </analyzer> </fieldType> >> > >> > I also added a bit of extra weighting for the "keyword" field so that >> exact >> > matches recieved a higher score. >> > >> > What this solution doesn't do is to exclude values like "EE 009", when >> > searching for "FE 009", but they return far down on the list, which for >> the >> > customer is ok, because usually these results are somewhat related og >> > within the same category. >> > >> > *Aleksander Akerø* >> > Systemkonsulent >> > Mobil: 944 89 054 >> > E-post: [email protected] >> > >> > *Gurusoft AS* >> > Telefon: 92 44 09 99 >> > Østre Kullerød >> > www.gurusoft.no >> > >> > >> > 2014-01-30 Jack Krupansky <[email protected]> >> > >> >> The standard, keyword-oriented query parsers will all treat unquoted, >> >> unescaped white space as term delimiters and ignore the what space. >> There >> >> is no way to bypass that behavior. So, your regex will never even see >> the >> >> white space - unless you enclose the text and white space in quotes or >> use >> >> a backslash to quote each white space character. >> >> >> >> You can use the "field" and "term" query parsers to pass a query string >> as >> >> if it were fully enclosed in quotes, but that only handles a single >> >> term >> >> and does not allow for multiple terms or any query operators. For >> example: >> >> >> >> {!field f=myfield}Foo Bar >> >> >> >> See: >> >> http://wiki.apache.org/solr/QueryParser >> >> >> >> You can also pre-configure the field query parser with the >> >> defType=field >> >> parameter. >> >> >> >> -- Jack Krupansky >> >> >> >> >> >> -----Original Message----- From: Srinivasa7 >> >> Sent: Thursday, January 30, 2014 6:37 AM >> >> >> >> To: [email protected] >> >> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches >> >> >> >> Hi, >> >> >> >> I have similar kind of problem where I want search for a words with >> >> spaces >> >> in that. And I wanted to search by stripping all the spaces . >> >> >> >> I have used following schema for that >> >> >> >> <fieldType name="nospaces" class="solr.TextField" >> >> autoGeneratePhraseQueries="true" > >> >> <analyzer type="index"> >> >> <tokenizer class="solr.KeywordTokenizerFactory"/> >> >> <filter class="solr.LowerCaseFilterFactory"/> >> >> <filter class="solr.PatternReplaceFilterFactory" >> >> pattern="[^\w]+" replacement="" replace="all"/> >> >> </analyzer> >> >> <analyzer type="query"> >> >> >> >> <tokenizer class="solr.KeywordTokenizerFactory"/> >> >> <filter class="solr.LowerCaseFilterFactory"/> >> >> <filter class="solr.PatternReplaceFilterFactory" >> >> pattern="[^\w]+" replacement="" replace="all"/> >> >> </analyzer> >> >> </fieldType> >> >> >> >> >> >> And >> >> >> >> >> >> <field name="text_nospaces" type="nospaces" indexed="true" >> stored="true" >> >> omitNorms="true" /> >> >> <copyField source="text" dest="text_nospaces" /> >> >> >> >> >> >> >> >> But it is not searching the right terms . we are stripping the spaces >> and >> >> indexing lowercase values when we do that. >> >> >> >> >> >> Like : East Enders >> >> >> >> when I seach for 'east end ers' text, its not returning any values >> >> saying >> >> no document found. >> >> >> >> I realised the solr uses QueryParser before passing query string to the >> >> QueryAnalyzer in defined in schema. >> >> >> >> And The Query parser is tokenizing the query string providing in query >> . So >> >> it is sending each token to the QueryAnalyser that is defined in >> >> schema. >> >> >> >> >> >> SO is there anyway that I can by pass this query parser or use a >> >> correct >> >> query processor which can consider the entire string as single pharse. >> >> >> >> At the moment I am using dismax query processor. >> >> >> >> Any suggestion would be much appreciated. >> >> >> >> Thanks >> >> Srinivasa >> >> >> >> >> >> >> >> -- >> >> View this message in context: http://lucene.472066.n3.nabble.com/ >> >> >> KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html >> >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> >> >> >
