How about sorting the tokens in alphabetical order both for indexing and query, then using the sentinel trick.
Source text: CENTURY BANCORP, INC Solr text: SENTINEL bancorp century inc SENTINEL wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 21, 2018, at 8:20 AM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > > Hmm, I was suggesting to put TokenCountingFilter at the end of both > indexing and query chains for the same (e.g. name_count) field. Then, > the search would be something like (warning, major syntax errors): > .../select? > queryname=CENTURY BANCORP, INC& > q=*:* > fq={!eDisMax v=queryname mm=100%}name& > fq={!complexphrase inOrder=true df=name_count v=queryname} > > So, the name_count would do the token match and it would allow for > synonyms of "INC" vs "INCORPORATED" as usual, if needed. > > Regards, > Alex. > > On 21 September 2018 at 10:45, Erick Erickson <erickerick...@gmail.com> wrote: >> A variant on Alexandre's approach is: >> at index time, count the tokens that will be produced yourself (this >> may be a little tricky, you shouldn't have WordDelimiterFilterFactory >> in your analysis for instance). >> Put the number of tokens in a separate field >> At query time, you'd search q=+company_name:(+century +bancorp +inc) >> +tokens_in_company_name_field:3 >> >> You don't need phrase queries with this approach, order doesn't matter. >> >> It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY >> BANCORP, INCORPORATED." match? >> >> Again, though, this means your indexing code has to do the same thing >> as your analysis chain. Which isn't very hard if the analysis chain is >> simple. I might use a char _filter_ factory to remove all >> non-alphanumeric characters, then a whitespace tokenizer and >> (probably) a lowercasefilter. That's pretty easy to replicate in order >> to count tokens. >> >> Best, >> Erick >> On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch >> <arafa...@gmail.com> wrote: >>> >>> I think you can match everything in the query to the field using either >>> 1) disMax/eDisMax with mm=100% >>> https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-parameter >>> 2) Complex Phrase Query Parser with inOrder=false: >>> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser >>> >>> The number of tokens though is hard. You only know what your tokens >>> are at the end of the indexing pipeline. And during search, the tokens >>> are looked up from their indexes and only then the documents are >>> looked up. >>> >>> You may be able to do this with custom Postfilter that would run after >>> everything else to just reject records with extra tokens. That would >>> not be too expensive. >>> >>> Or (possibly simpler way) you could try to precalculate things, by >>> writing a custom TokenFilter that takes a stream and returns token >>> count to be used as a copyField target. Then you send your query to >>> the same field with any full-query preserving syntax, either as a >>> phrase or as a field query parser: >>> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser >>> >>> I would love to know if any/all of this works for you. >>> >>> Regards, >>> Alex. >>> >>> On 21 September 2018 at 09:00, marotosg <marot...@gmail.com> wrote: >>>> Hi, >>>> >>>> I have to search for company names where my first requirement is to find >>>> only exact matches on the company name. >>>> >>>> For instance if I search for "CENTURY BANCORP, INC." I shouldn't find "NEW >>>> CENTURY BANCORP, INC." >>>> because the result company has the extra keyword "NEW". >>>> >>>> I can't use exact match because the sequence of tokens may differ. >>>> Basically >>>> I need to find results where the tokens are the same in any order and the >>>> number of tokens match. >>>> >>>> I have no idea if it's possible as include in the query the number of >>>> tokens >>>> and solr field has that info within to match it. >>>> >>>> Thanks for your help >>>> Sergio >>>> >>>> >>>> >>>> -- >>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html