Hmm, I was suggesting to put TokenCountingFilter at the end of both indexing and query chains for the same (e.g. name_count) field. Then, the search would be something like (warning, major syntax errors): .../select? queryname=CENTURY BANCORP, INC& q=*:* fq={!eDisMax v=queryname mm=100%}name& fq={!complexphrase inOrder=true df=name_count v=queryname}
So, the name_count would do the token match and it would allow for synonyms of "INC" vs "INCORPORATED" as usual, if needed. Regards, Alex. On 21 September 2018 at 10:45, Erick Erickson <erickerick...@gmail.com> wrote: > A variant on Alexandre's approach is: > at index time, count the tokens that will be produced yourself (this > may be a little tricky, you shouldn't have WordDelimiterFilterFactory > in your analysis for instance). > Put the number of tokens in a separate field > At query time, you'd search q=+company_name:(+century +bancorp +inc) > +tokens_in_company_name_field:3 > > You don't need phrase queries with this approach, order doesn't matter. > > It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY > BANCORP, INCORPORATED." match? > > Again, though, this means your indexing code has to do the same thing > as your analysis chain. Which isn't very hard if the analysis chain is > simple. I might use a char _filter_ factory to remove all > non-alphanumeric characters, then a whitespace tokenizer and > (probably) a lowercasefilter. That's pretty easy to replicate in order > to count tokens. > > Best, > Erick > On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch > <arafa...@gmail.com> wrote: >> >> I think you can match everything in the query to the field using either >> 1) disMax/eDisMax with mm=100% >> https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-parameter >> 2) Complex Phrase Query Parser with inOrder=false: >> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser >> >> The number of tokens though is hard. You only know what your tokens >> are at the end of the indexing pipeline. And during search, the tokens >> are looked up from their indexes and only then the documents are >> looked up. >> >> You may be able to do this with custom Postfilter that would run after >> everything else to just reject records with extra tokens. That would >> not be too expensive. >> >> Or (possibly simpler way) you could try to precalculate things, by >> writing a custom TokenFilter that takes a stream and returns token >> count to be used as a copyField target. Then you send your query to >> the same field with any full-query preserving syntax, either as a >> phrase or as a field query parser: >> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser >> >> I would love to know if any/all of this works for you. >> >> Regards, >> Alex. >> >> On 21 September 2018 at 09:00, marotosg <marot...@gmail.com> wrote: >> > Hi, >> > >> > I have to search for company names where my first requirement is to find >> > only exact matches on the company name. >> > >> > For instance if I search for "CENTURY BANCORP, INC." I shouldn't find "NEW >> > CENTURY BANCORP, INC." >> > because the result company has the extra keyword "NEW". >> > >> > I can't use exact match because the sequence of tokens may differ. >> > Basically >> > I need to find results where the tokens are the same in any order and the >> > number of tokens match. >> > >> > I have no idea if it's possible as include in the query the number of >> > tokens >> > and solr field has that info within to match it. >> > >> > Thanks for your help >> > Sergio >> > >> > >> > >> > -- >> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html