Re: Case sensitivity on hostnames and email addresses

Yonik Seeley Wed, 13 Dec 2006 21:57:06 -0800

Oh, and yet another way to get around it (with it's own trade offs) is
to use something like fieldtype textTight in the example schema.xml,
which catenates all word parts in both the index analyzer and query
analyzer.


This would index as "upanddownmysitecom" and allow the following
queries to match:
"[EMAIL PROTECTED]", "[EMAIL PROTECTED]/com", "[EMAIL PROTECTED]"

The downside is that it would *not* allow "upanddown" or "UpAndDown" to match.

-Yonik

On 12/14/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 12/13/06, Wade Leftwich <[EMAIL PROTECTED]> wrote:
> I've run into some unexpected case sensitivity on searches, at least
> unexpected by me.
>
> If you index a text field containing this sentence:
>
> A sentence containing CamelCase words by [EMAIL PROTECTED] is found
> at StudlyCaps.org
>
> The document will be found by searching for "camelcase" but not for
> "[EMAIL PROTECTED]" or "studlycaps.org".
>
> This happens with the Standard or the DisMax query handler.
>
> A bit of a problem for me, because I'm indexing a bunch of business
> magazines, and domain names are frequently capitalized, often in CamelCase.

It's your text analysis configuration.
The WordDelimiterFilter is doing this... it's so "CamelCase" can be
found searching for "camelcase", "camel-case" or "camel case".
It does this by detecting all the word parts and then indexing them
separately as well as all catenated.  So "CamelCase" is indexed as
both both "camelcase" and "camel case".
When searching, the WordDelimiterFilter is configured to split only,
so "camelcase", "camel-case", and "camel case" will all match.

When it hits something like [EMAIL PROTECTED], it would index it as
"upanddownmysitecom" and "up and down mysite com"
On the search side, a search of "[EMAIL PROTECTED]" is broken into
"upanddown mysite com" which doesn't match anything indexed.

There are a number of options, not limited to
 - create a new fieldtype and throw out the WordDelimiterFilter... the
current "text"
   field type is for demonstration purposes only anyway.  Solr, like
Lucene, is meant
   to be customized.
 - If you want to keep the camel-case flexibility, but not across "."
and "-", then
   try using a letter tokenizer to throw away the non-letter tokenizers first.
 - create a specific filter for email or website addresses if no combination of
   existing filters do what you want.

Play around with the analysis tool on the admin page, it will help you
understand what's going on.

-Yonik

Re: Case sensitivity on hostnames and email addresses

Reply via email to