Oh, and yet another way to get around it (with it's own trade offs) is to use something like fieldtype textTight in the example schema.xml, which catenates all word parts in both the index analyzer and query analyzer.
This would index as "upanddownmysitecom" and allow the following queries to match: "[EMAIL PROTECTED]", "[EMAIL PROTECTED]/com", "[EMAIL PROTECTED]" The downside is that it would *not* allow "upanddown" or "UpAndDown" to match. -Yonik On 12/14/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
On 12/13/06, Wade Leftwich <[EMAIL PROTECTED]> wrote: > I've run into some unexpected case sensitivity on searches, at least > unexpected by me. > > If you index a text field containing this sentence: > > A sentence containing CamelCase words by [EMAIL PROTECTED] is found > at StudlyCaps.org > > The document will be found by searching for "camelcase" but not for > "[EMAIL PROTECTED]" or "studlycaps.org". > > This happens with the Standard or the DisMax query handler. > > A bit of a problem for me, because I'm indexing a bunch of business > magazines, and domain names are frequently capitalized, often in CamelCase. It's your text analysis configuration. The WordDelimiterFilter is doing this... it's so "CamelCase" can be found searching for "camelcase", "camel-case" or "camel case". It does this by detecting all the word parts and then indexing them separately as well as all catenated. So "CamelCase" is indexed as both both "camelcase" and "camel case". When searching, the WordDelimiterFilter is configured to split only, so "camelcase", "camel-case", and "camel case" will all match. When it hits something like [EMAIL PROTECTED], it would index it as "upanddownmysitecom" and "up and down mysite com" On the search side, a search of "[EMAIL PROTECTED]" is broken into "upanddown mysite com" which doesn't match anything indexed. There are a number of options, not limited to - create a new fieldtype and throw out the WordDelimiterFilter... the current "text" field type is for demonstration purposes only anyway. Solr, like Lucene, is meant to be customized. - If you want to keep the camel-case flexibility, but not across "." and "-", then try using a letter tokenizer to throw away the non-letter tokenizers first. - create a specific filter for email or website addresses if no combination of existing filters do what you want. Play around with the analysis tool on the admin page, it will help you understand what's going on. -Yonik