Hi there,

You should use LowerCaseTokenizerFactory as you point out yourself. As far as I know, the StandardTokenizer "recognizes email addresses and internet hostnames as one token". In your case, I guess you want an email, say "[EMAIL PROTECTED]" to be split into four tokens: average joe apache org, or something like that, which would indeed allow you to search for "joe" or "average j*" and match. To do so, you could use the WordDelimiterFilterFactory and split on intra-word delimiters (I think the defaults here are non-alphanumeric chars).

Take a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for more info on tokenizers and filters.

cheers,
 Aleks

On Tue, 18 Nov 2008 08:35:31 +0100, Carsten L <[EMAIL PROTECTED]> wrote:


Hello.

The data:
I have a dataset containing ~500.000 documents.
In each document there is an email, a name and an user ID.

The problem:
I would like to be able to search in it, but it should be like the "MySQL
LIKE".

So when a user enters the search term: "carsten", then the query looks like:
        "name:(carsten) OR name:(carsten*) OR email:(carsten) OR
email:(carsten*) OR userid:(carsten) OR userid:(carsten*)"

Then it should match:
carsten l
carsten larsen
Carsten Larsen
Carsten
CARSTEN
etc.

And when the user enters the term: "carsten l" the query looks like:
        "name:(carsten l) OR name:(carsten l*) OR email:(carsten l) OR
email:(carsten l*) OR userid:(carsten l) OR userid:(carsten l*)"

Then it should match:
carsten l
carsten larsen
Carsten Larsen

Or written to the MySQL syntax: "... WHERE `name` LIKE 'carsten%'  OR
`email` LIKE 'carsten%' OR `userid` LIKE 'carsten%'..."

I know that I need to use the "solr.LowerCaseTokenizerFactory" on my name
and email field, to ensure case insentitive behavior.
The problem seems to be the wildcards and the whitespaces.



--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

Reply via email to