Hello solr-users! I'm a bit stumped and after some days of trial-and-error, I've come to the conclusion that I cannot figure this out by myself.
Where I'm at: Solr 7.7 in cloud mode: - 3 shards, - 1 replication factor, - 1 shards per node, - 3 nodes, - coordinated with external zookeeper - running on three different VMs What I do: I'm building a search backend for academic citations, one of the most important data are the authors. They are stored as .. managed-schema: . . <fieldType name="important_strings" class="solr.StrField" . sortMissingLast="true" docValues="true" indexed="true" stored="true" . multiValued="true"/> . . <field name="author" type="important_strings"/> . and a random sample from the relevant data: "author":["Stefan Diepenbrock", "Timo Ropinski", "Klaus H. Hinrichs"], What I'd like to achieve: I'd like to provide (auto-complete) suggestions based on the names. Starting with the easy case: ---------------------------- Someone sends a query for 'diepen' I'd want to match case-insensitive on all authors having 'diepen' as prefix in their (sur-)names. In this example, matching 'Stefan [Diepen]brock' I got this working with defining a new field type for the suggester .. managed-schema: . . <fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="1000"> . <analyzer type="index"> . <tokenizer class="solr.LowerCaseTokenizerFactory"/> . <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/> . </analyzer> . <analyzer type="query"> . <tokenizer class="solr.LowerCaseTokenizerFactory"/> . </analyzer> . </fieldType> . and using that in the searchComponent .. solrconfig.xml: . . <searchComponent name="authorsuggest" class="solr.SuggestComponent"> . <lst name="suggester"> . <str name="name">default</str> . <str name="lookupImpl">AnalyzingInfixLookupFactory</str> . <str name="dictionaryImpl">DocumentDictionaryFactory</str> . <str name="field">author</str> . <str name="allTermsRequired">true</str> . <str name="highlight">true</str> . <str name="minPrefixChars">4</str> . <str name="suggestAnalyzerFieldType">text_prefix</str> . <str name="buildOnStartup">false</str> . </lst> . </searchComponent> . . <requestHandler name="/authors" class="solr.SearchHandler" startup="lazy"> . <lst name="defaults"> . <str name="suggest">true</str> . <str name="suggest.count">10</str> . </lst> . <arr name="components"> . <str>authorsuggest</str> . </arr> . </requestHandler> . After building with curl 'http://localhost:8983/solr/dblp/authors?suggest.build=true' this will yield someting along the lines of .. curl 'http://localhost:8983/solr/dblp/authors?suggest.q=Diepen' . { . "suggest":{"default":{ . "Diepen":{ . "numFound":10, . "suggestions":[{ . "term":"M. Diepenhorst", . "weight":0, . "payload":""}, . { . "term":"Sjoerd Diepen", . "weight":0, . "payload":""}, . { . "term":"Stefan Diepenbrock", . "weight":0, . "payload":""}, . { . /* abbreviated */ . This might all have worked out by accident. So if you see something wierd: this is what I ended up with after running against this wall, trying out different things. Now the tricky part: -------------------- If someone were to type two prefixes of an author's name: 'Stef Diep' or 'Diep Stef' I want to match these white-space seperated prefixes on all names of the author and deliver the results were *both* prefixes match before the others. Because with this, curl yields: .. curl 'http://localhost:8983/solr/dblp/authors?suggest.q=Stef%20Diep' . { . "suggest":{"default":{ . "Stef Diepen":{ . "numFound":10, . "suggestions":[{ . "term":"J. Gregory Steffan", . "weight":0, . "payload":""}, . { . "term":"Stefano Spaccapietra", . "weight":0, . "payload":""}, . { . /* abbreviated */ . even, when providing the full name as "suggest.q=Stefan%20Diepenbrock". Other stuff that's weird: - I'm getting duplicates, like ten times the same name - Suggester results are non-deterministic These are not as important and I guess they due to running in cloud-mode. I've tried: - reading - through some of the lucene JavaDocs, since the solr-ref-guide is a bit sparse on information about the variables. - the ref-guide, over and over - many blogs based on old Solr versions (ab)using spellcheck for suggestions, - and several other pages I found. - other combinations of analyzers, tokenizers and filters - other Dict and Lookup Implementations (the wrong ones?) but no such luck. I hope I did not leave anything relevant out. regards, -1