Re: Special Characters with Wildcard

Robi Petersen Wed, 24 Jan 2024 08:26:15 -0800

Hi Dario

string types are good for exact match and faceting.
text does some tokenization at least
star searches are expensive and don't give good results as you have found.
you're adding the star for search side only anyway (it's like a poor man's
autocomplete)
sounds like you want an autocomplete solution, try ngrams
this should get you started:
https://blog.andornot.com/blog/advanced-autocomplete-with-solr-ngrams/


best
Robi

On Wed, Jan 24, 2024 at 12:11 AM <[email protected]> wrote:

> Hello all
>
> We have fields on documents that contain special characters. (e.g. "-",
> "@" or "/" ...)
> Now when we want to search on this field we encountered some weird
> behavior. Hopefully you guys can help.
>
> Let's suppose we have the fallowing Document:
> {
>       "id":"id_number",
>       "content":"Hello World-Wide-Web"
> }
>
> Now we want to search for a substrings in content.
> When a user types "Hello World-Wide-Web" we obviously want to find the
> document. This works flawlessly.
>
> Now we could just use a field of type string.
> But a typical user obviously wants to find the document when he does not
> type in the whole string. They also want the search to be case insensitive.
>
>
>   *   Whenever there is a query written out, that's not necessarily the
> query that the user typed themselves, but the query that is generated for
> them.
>
> So instead of type string we use type text.
> This well tokenize our field.
> So now we have the Tokens: Hello, World, Wide and Web
> So now a user can search for "woRLd hELlo" (tokens are found in any order,
> and random case) and they will still find the document.
> That's pretty good already. But in our case we want to allow incomplete
> searches to find documents as well.
> For that we thought that the way to go is to use the wildcard "*".
> While this normally works just fine:
> "content":(Hello wor*)
> The user types in "hello wor" and before finishing their sentence, the
> document is found.
> But now the user searches for world-wide.
> "content":world-wide
> They will obviously find the document. But before the user finished typing
> the second word, they would also expect the document to be found.
> But
> "content":world-wid*
> Does not work. Our supposition is, that the query will use the same
> tokenizer that was used when indexing the content. (That's why "world-wide"
> works the same as "world wide")
> But the moment we use wildcards this assumption fails. Somehow solr is no
> longer able to tokenize the query.
> Now of course we could just replace all special characters with a space.
> But this
>
> What are we doing wrong? Is there a gotcha that we need to take into
> account when using wildcards with special characters?
>
> With kind Regards
>
> Dario Viva
>

Re: Special Characters with Wildcard

Reply via email to