How about sorting the tokens in alphabetical order both for indexing and query,
then using the sentinel trick.

Source text: CENTURY BANCORP, INC

Solr text: SENTINEL bancorp century inc SENTINEL

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 21, 2018, at 8:20 AM, Alexandre Rafalovitch <arafa...@gmail.com> wrote:
> 
> Hmm, I was suggesting to put TokenCountingFilter at the end of both
> indexing and query chains for the same (e.g. name_count) field. Then,
> the search would be something like (warning, major syntax errors):
> .../select?
> queryname=CENTURY BANCORP, INC&
> q=*:*
> fq={!eDisMax v=queryname mm=100%}name&
> fq={!complexphrase inOrder=true df=name_count v=queryname}
> 
> So, the name_count would do the token match and it would allow for
> synonyms of "INC" vs "INCORPORATED" as usual, if needed.
> 
> Regards,
>   Alex.
> 
> On 21 September 2018 at 10:45, Erick Erickson <erickerick...@gmail.com> wrote:
>> A variant on Alexandre's approach is:
>> at index time, count the tokens that will be produced yourself (this
>> may be a little tricky, you shouldn't have WordDelimiterFilterFactory
>> in your analysis for instance).
>> Put the number of tokens in a separate field
>> At query time, you'd search q=+company_name:(+century +bancorp +inc)
>> +tokens_in_company_name_field:3
>> 
>> You don't need phrase queries with this approach, order doesn't matter.
>> 
>> It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY
>> BANCORP, INCORPORATED." match?
>> 
>> Again, though, this means your indexing code has to do the same thing
>> as your analysis chain. Which isn't very hard if the analysis chain is
>> simple. I might use a char _filter_ factory to remove all
>> non-alphanumeric characters, then a whitespace tokenizer and
>> (probably) a lowercasefilter. That's pretty easy to replicate in order
>> to count tokens.
>> 
>> Best,
>> Erick
>> On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch
>> <arafa...@gmail.com> wrote:
>>> 
>>> I think you can match everything in the query to the field using either
>>> 1) disMax/eDisMax with mm=100%
>>> https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-parameter
>>> 2) Complex Phrase Query Parser with inOrder=false:
>>> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
>>> 
>>> The number of tokens though is hard. You only know what your tokens
>>> are at the end of the indexing pipeline. And during search, the tokens
>>> are looked up from their indexes and only then the documents are
>>> looked up.
>>> 
>>> You may be able to do this with custom Postfilter that would run after
>>> everything else to just reject records with extra tokens. That would
>>> not be too expensive.
>>> 
>>> Or (possibly simpler way) you could try to precalculate things, by
>>> writing a custom TokenFilter that takes a stream and returns token
>>> count to be used as a copyField target. Then you send your query to
>>> the same field with any full-query preserving syntax, either as a
>>> phrase or as a field query parser:
>>> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
>>> 
>>> I would love to know if any/all of this works for you.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> On 21 September 2018 at 09:00, marotosg <marot...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> I have to search for company names where my first requirement is to find
>>>> only exact matches on the company name.
>>>> 
>>>> For instance if I search for "CENTURY BANCORP, INC." I shouldn't find "NEW
>>>> CENTURY BANCORP, INC."
>>>> because the result company has the extra keyword "NEW".
>>>> 
>>>> I can't use exact match because the sequence of tokens may differ. 
>>>> Basically
>>>> I need to find results where the  tokens are the same in any order and the
>>>> number of tokens match.
>>>> 
>>>> I have no idea if it's possible as include in the query the number of 
>>>> tokens
>>>> and solr field has that info within to match it.
>>>> 
>>>> Thanks for your help
>>>> Sergio
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to