Hello all

We have fields on documents that contain special characters. (e.g. "-", "@" or 
"/" ...)
Now when we want to search on this field we encountered some weird behavior. 
Hopefully you guys can help.

Let's suppose we have the fallowing Document:
{
      "id":"id_number",
      "content":"Hello World-Wide-Web"
}

Now we want to search for a substrings in content.
When a user types "Hello World-Wide-Web" we obviously want to find the 
document. This works flawlessly.

Now we could just use a field of type string.
But a typical user obviously wants to find the document when he does not type 
in the whole string. They also want the search to be case insensitive.


  *   Whenever there is a query written out, that's not necessarily the query 
that the user typed themselves, but the query that is generated for them.

So instead of type string we use type text.
This well tokenize our field.
So now we have the Tokens: Hello, World, Wide and Web
So now a user can search for "woRLd hELlo" (tokens are found in any order, and 
random case) and they will still find the document.
That's pretty good already. But in our case we want to allow incomplete 
searches to find documents as well.
For that we thought that the way to go is to use the wildcard "*".
While this normally works just fine:
"content":(Hello wor*)
The user types in "hello wor" and before finishing their sentence, the document 
is found.
But now the user searches for world-wide.
"content":world-wide
They will obviously find the document. But before the user finished typing the 
second word, they would also expect the document to be found.
But
"content":world-wid*
Does not work. Our supposition is, that the query will use the same tokenizer 
that was used when indexing the content. (That's why "world-wide" works the 
same as "world wide")
But the moment we use wildcards this assumption fails. Somehow solr is no 
longer able to tokenize the query.
Now of course we could just replace all special characters with a space. But 
this

What are we doing wrong? Is there a gotcha that we need to take into account 
when using wildcards with special characters?

With kind Regards

Dario Viva

Reply via email to