... But this is not really an elegant solution. As we would need to do a find and replace on the whole search string, and depending on what tokenizer was used, we would need to replace different characters.
On 2024/01/24 08:10:41 [email protected] wrote: > Hello all > > We have fields on documents that contain special characters. (e.g. "-", "@" > or "/" ...) > Now when we want to search on this field we encountered some weird behavior. > Hopefully you guys can help. > > Let's suppose we have the fallowing Document: > { > "id":"id_number", > "content":"Hello World-Wide-Web" > } > > Now we want to search for a substrings in content. > When a user types "Hello World-Wide-Web" we obviously want to find the > document. This works flawlessly. > > Now we could just use a field of type string. > But a typical user obviously wants to find the document when he does not type > in the whole string. They also want the search to be case insensitive. > > > * Whenever there is a query written out, that's not necessarily the query > that the user typed themselves, but the query that is generated for them. > > So instead of type string we use type text. > This well tokenize our field. > So now we have the Tokens: Hello, World, Wide and Web > So now a user can search for "woRLd hELlo" (tokens are found in any order, > and random case) and they will still find the document. > That's pretty good already. But in our case we want to allow incomplete > searches to find documents as well. > For that we thought that the way to go is to use the wildcard "*". > While this normally works just fine: > "content":(Hello wor*) > The user types in "hello wor" and before finishing their sentence, the > document is found. > But now the user searches for world-wide. > "content":world-wide > They will obviously find the document. But before the user finished typing > the second word, they would also expect the document to be found. > But > "content":world-wid* > Does not work. Our supposition is, that the query will use the same tokenizer > that was used when indexing the content. (That's why "world-wide" works the > same as "world wide") > But the moment we use wildcards this assumption fails. Somehow solr is no > longer able to tokenize the query. > Now of course we could just replace all special characters with a space. But > this > > What are we doing wrong? Is there a gotcha that we need to take into account > when using wildcards with special characters? > > With kind Regards > > Dario Viva >
