RE: Special Characters with Wildcard

Dario.Viva Wed, 24 Jan 2024 00:46:13 -0800

... But this is not really an elegant solution. As we would need to do a find 
and replace on the whole search string, and depending on what tokenizer was 
used, we would need to replace different characters.


On 2024/01/24 08:10:41 [email protected] wrote:
> Hello all
>
> We have fields on documents that contain special characters. (e.g. "-", "@" 
> or "/" ...)
> Now when we want to search on this field we encountered some weird behavior. 
> Hopefully you guys can help.
>
> Let's suppose we have the fallowing Document:
> {
>       "id":"id_number",
>       "content":"Hello World-Wide-Web"
> }
>
> Now we want to search for a substrings in content.
> When a user types "Hello World-Wide-Web" we obviously want to find the 
> document. This works flawlessly.
>
> Now we could just use a field of type string.
> But a typical user obviously wants to find the document when he does not type 
> in the whole string. They also want the search to be case insensitive.

>
>
>   *   Whenever there is a query written out, that's not necessarily the query 
> that the user typed themselves, but the query that is generated for them.

>
> So instead of type string we use type text.
> This well tokenize our field.
> So now we have the Tokens: Hello, World, Wide and Web
> So now a user can search for "woRLd hELlo" (tokens are found in any order, 
> and random case) and they will still find the document.

> That's pretty good already. But in our case we want to allow incomplete 
> searches to find documents as well.
> For that we thought that the way to go is to use the wildcard "*".
> While this normally works just fine:
> "content":(Hello wor*)
> The user types in "hello wor" and before finishing their sentence, the 
> document is found.
> But now the user searches for world-wide.
> "content":world-wide
> They will obviously find the document. But before the user finished typing 
> the second word, they would also expect the document to be found.

> But
> "content":world-wid*
> Does not work. Our supposition is, that the query will use the same tokenizer 
> that was used when indexing the content. (That's why "world-wide" works the 
> same as "world wide")

> But the moment we use wildcards this assumption fails. Somehow solr is no 
> longer able to tokenize the query.
> Now of course we could just replace all special characters with a space. But 
> this
>
> What are we doing wrong? Is there a gotcha that we need to take into account 
> when using wildcards with special characters?

>
> With kind Regards
>
> Dario Viva
>

RE: Special Characters with Wildcard

Reply via email to