On 12/13/06, Wade Leftwich <[EMAIL PROTECTED]> wrote:
I've run into some unexpected case sensitivity on searches, at least
unexpected by me.
If you index a text field containing this sentence:
A sentence containing CamelCase words by [EMAIL PROTECTED] is found
at StudlyCaps.org
The document will be found by searching for "camelcase" but not for
"[EMAIL PROTECTED]" or "studlycaps.org".
This happens with the Standard or the DisMax query handler.
A bit of a problem for me, because I'm indexing a bunch of business
magazines, and domain names are frequently capitalized, often in CamelCase.
It's your text analysis configuration.
The WordDelimiterFilter is doing this... it's so "CamelCase" can be
found searching for "camelcase", "camel-case" or "camel case".
It does this by detecting all the word parts and then indexing them
separately as well as all catenated. So "CamelCase" is indexed as
both both "camelcase" and "camel case".
When searching, the WordDelimiterFilter is configured to split only,
so "camelcase", "camel-case", and "camel case" will all match.
When it hits something like [EMAIL PROTECTED], it would index it as
"upanddownmysitecom" and "up and down mysite com"
On the search side, a search of "[EMAIL PROTECTED]" is broken into
"upanddown mysite com" which doesn't match anything indexed.
There are a number of options, not limited to
- create a new fieldtype and throw out the WordDelimiterFilter... the
current "text"
field type is for demonstration purposes only anyway. Solr, like
Lucene, is meant
to be customized.
- If you want to keep the camel-case flexibility, but not across "."
and "-", then
try using a letter tokenizer to throw away the non-letter tokenizers first.
- create a specific filter for email or website addresses if no combination of
existing filters do what you want.
Play around with the analysis tool on the admin page, it will help you
understand what's going on.
-Yonik