Queries with wildcards
Hi, I figure I'm probably being stupid, but I can't seem to get queries (using the standard request handler) using wildcards to work. For example, using the latest build (Aug 18) and the example documents, a search for Enterprise matches the SOLR1000 document, but a search for Enter* does not. I guess I'd kind of assumed that wildcards would work (being part of the linked to Lucene query syntax), but I'd never actually tested it (or not with anything other than brain-dead searches that probably worked due to stemming rather than wildcards). So, does it work, and I'm just totally hopeless/incompetent (it is after all a Friday afternoon)? Or is there some good reason why wildcards don't work? Thanks, Andrew
Re: Queries with wildcards
: For example, using the latest build (Aug 18) and the example documents, a search for : Enterprise matches the SOLR1000 document, but a search for Enter* does not. try searching for:enter* ...this is a somewhat long standing anoyance with Lucene, that exists because there's really no good way to deal with it -- when using Wildcards, the Lucene QueryParser does not analyze the input -- if you ask for a wildCard search on Enter*, a PrefixQuery is constructed with that exact prefix, case an all. But in this case, the default search field is "text" which uses the LowerCaseFilter -- so you'll never get a match on a prefix with an upersapce character. The reason that the QueryParser doesn't attempt to analyze the input you give it when doing a PrefixQuery, is because it might get analyzed in a completley differnet way then the words that prefix "logically" matches on. Consider for example using a Porter stemmer on "enterprise" -- that produces "enterpris" ... but if you asked for a prefix search for "enterpris*", and the query parser analyzed "enterpris" then the PorterStemmer would produce "enterpri" The problem gets even worse when dealing with mid-word WildCards like "Ent*prise" ... how can the QueryParser even approach trying to analyze that input -- the * certianly isnt' aprt ofthe text, should it split it up into two words and analyze them seperatly, and then rejoin them with a Star in the middle? In general, Wildcard queries are "hard" and only make sense on fields that have very simplistic Index time analyzers (like WhitespaceAnalyzer) -- even then you might want to use the LowercaseFilter and override the QueryParser's getPrfixQuery and getWildCardQuery methods to do things like lowercase the input string for certain fields so you don't get anoying situations like enter* not matching Enterprise. -Hoss
Re: Queries with wildcards
Chris Hostetter wrote: : For example, using the latest build (Aug 18) and the example documents, a search for : Enterprise matches the SOLR1000 document, but a search for Enter* does not. try searching for:enter* Ah-ha! ...this is a somewhat long standing anoyance with Lucene, that exists because there's really no good way to deal with it -- when using Wildcards, the Lucene QueryParser does not analyze the input -- if you ask for a wildCard search on Enter*, a PrefixQuery is constructed with that exact prefix, case an all. But in this case, the default search field is "text" which uses the LowerCaseFilter -- so you'll never get a match on a prefix with an upersapce character. In general, Wildcard queries are "hard" and only make sense on fields that have very simplistic Index time analyzers (like WhitespaceAnalyzer) -- even then you might want to use the LowercaseFilter and override the QueryParser's getPrfixQuery and getWildCardQuery methods to do things like lowercase the input string for certain fields so you don't get anoying situations like enter* not matching Enterprise. Thanks, I think I understand now. Given that I'm doing some processing of the user input before passing the query onto Solr, I can convert the query to lowercase before passing it on. -Andrew