Re: Howto search for § character

Erick Erickson Thu, 07 Dec 2017 07:56:02 -0800

The admin UI/(select core)/analysis page will help you see exactly
what happens. Additionally, the "schema browser" bit will show you
exactly what's in the index, i.e. the terms as they actually appear
after all the analysis chain is completed. Those will definitively
tell you what exactly happens with that character.


Best,
Erick

On Thu, Dec 7, 2017 at 7:37 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 12/6/2017 9:09 AM, Bernd Schmidt wrote:
>> we have defined a field named "_text_" for a full text search based on 
>> field-type "text_general":
>> <field name="_text_" type="text_general" multiValued="true" indexed="true" 
>> stored="false"/>"
>>
>> When trying to search for the "§" character, we have strange behaviour:
>>
>> q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all nodes 
>> where entityClass:StructureNodeImpl)
>> q=_text_:§ => numFound:0
>>
>> How can we search for the occurence of the § character?
>
> We can't see how your "text_general" type is defined, but if it is
> anything like the same type included in Solr examples, then it probably
> is using StandardTokenizerFactory.  It appears that this tokenizer
> treats the § character as a word break and removes it from the token
> stream.  Most likely, the reason the search with the extra clause works
> is that the part with that character is removed, and the query ends up
> ONLY being the extra clause.
>
> You will need a fieldType with an analysis chain that doesn't remove the
> § character, and it's almost guaranteed that you'll need to reindex.
> Unless you do that, searching for that character is not going to be
> possible.
>
> Also keep in mind that searching for a single character may not do what
> you expect if that character is not a single word in the text, and that
> certain filters can end up trimming out really short terms like that.
>
> Thanks,
> Shawn
>

Re: Howto search for § character

Reply via email to