Hi there,
You should use LowerCaseTokenizerFactory as you point out yourself. As far
as I know, the StandardTokenizer "recognizes email addresses and internet
hostnames as one token". In your case, I guess you want an email, say
"[EMAIL PROTECTED]" to be split into four tokens: average joe apache
org, or something like that, which would indeed allow you to search for
"joe" or "average j*" and match. To do so, you could use the
WordDelimiterFilterFactory and split on intra-word delimiters (I think the
defaults here are non-alphanumeric chars).
Take a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
for more info on tokenizers and filters.
cheers,
Aleks
On Tue, 18 Nov 2008 08:35:31 +0100, Carsten L <[EMAIL PROTECTED]> wrote:
Hello.
The data:
I have a dataset containing ~500.000 documents.
In each document there is an email, a name and an user ID.
The problem:
I would like to be able to search in it, but it should be like the "MySQL
LIKE".
So when a user enters the search term: "carsten", then the query looks
like:
"name:(carsten) OR name:(carsten*) OR email:(carsten) OR
email:(carsten*) OR userid:(carsten) OR userid:(carsten*)"
Then it should match:
carsten l
carsten larsen
Carsten Larsen
Carsten
CARSTEN
etc.
And when the user enters the term: "carsten l" the query looks like:
"name:(carsten l) OR name:(carsten l*) OR email:(carsten l) OR
email:(carsten l*) OR userid:(carsten l) OR userid:(carsten l*)"
Then it should match:
carsten l
carsten larsen
Carsten Larsen
Or written to the MySQL syntax: "... WHERE `name` LIKE 'carsten%' OR
`email` LIKE 'carsten%' OR `userid` LIKE 'carsten%'..."
I know that I need to use the "solr.LowerCaseTokenizerFactory" on my name
and email field, to ensure case insentitive behavior.
The problem seems to be the wildcards and the whitespaces.
--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no