Re: Howto search for § character

Tim Casey Thu, 07 Dec 2017 17:34:05 -0800

My last company we ended up writing a custom analyzer to handle
punctuation.  But this was for lucent 2 or 3.  That analyzer was carried
forward as we updated and was used for all human derived text.


Although now there are way better analyzers and way better ways to hook
them up, as noted above by Erick, We really cared about how this was done
and all of the work put into the analyzer paid off.

I would expect there to be an analyzer which would maintain punctuation
tokens for search.  One of the issues which comes up is if you want
multiple-runs of punctuation to be a single token or separate tokens.  So
what happens to "§!"  or "§?" or "?§", and in the case of things like
text/email what happens to "§!!!!".

In any event, my 2 pence worth....

tim

On Thu, Dec 7, 2017 at 10:00 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 12/7/2017 9:37 AM, Bernd Schmidt wrote:
> > Indeed, I saw in the analysis tab of the solr admin that the § char will
> be removed when using type text_general.
> > But in this use case we want to make a full text search like
> "_text_:§45" or "_text_:§*" to find words starting with §.
> > We need a text field here, not a string field!
> > What is your recommended way to deal with it?
> > Is it possible to remove the word break behaviour for the  § char?
> > Or is the best way to encode all § chars when indexing and searching?
>
> This character is classified by Unicode as punctuation:
>
> http://www.fileformat.info/info/unicode/char/00a7/index.htm
>
> Almost any example field type for full-text search that you're likely to
> encounter is going to be designed to split on punctuation and remove it
> from the token stream.  That's one of the most common things that
> full-text search engines do.
>
> You're going to need to design a new analysis chain that *doesn't* do
> this, apply the fieldType containing that analysis to your field,
> restart/reload, and reindex.
>
> Designing analysis chains is an art form, and tends to be one of the
> hardest parts of setting up a production Solr install.  It took me at
> least a month of almost constant work to settle on the schema design for
> the indexes that I maintain.  All of the "solr.TextField" types in my
> schema are completely custom -- none of the analysis chains in Solr
> examples are in that schema.
>
> Thanks,
> Shawn
>
>

Re: Howto search for § character

Reply via email to