Filtering out special characters sounds like a good idea, or possibly escaping some of them. I definitely want to avoid brittleness.
Right now I'm passing the query relatively "as is" which means users can type "title:foo" to find documents that have "foo" in the "title" field. But a query for just a colon (":") throws an error (org.apache.solr.search.SyntaxError: Cannot parse ':') so obviously I need to do more processing of the query before I pass it to Solr. I need to escape that colon or something. Is there some general advice on doing some sanity checks or escaping special characters on user-supplied queries before you pass them to Solr? Is it documented in the wiki? I'm using Solrj but I imagine the advice applies to everyone. Phil p.s. I noticed a note saying "These characters are part of the query syntax and must be escaped" at https://github.com/apache/lucene-solr/blob/lucene_solr_4_7_0/solr/solrj/src/java/org/apache/solr/client/solrj/util/ClientUtils.java#L231 and learned of this part of the code from http://lucene.472066.n3.nabble.com/What-is-the-full-list-of-Solr-Special-Characters-td4094053.html On Tue, Apr 8, 2014 at 10:14 AM, Erick Erickson <erickerick...@gmail.com> wrote: > I'd seriously consider filtering these characters out when you index > and search, this is quite likely very brittle. The same item, say from > two different vendors, might have D (E & F) or D E & F. If you just > stripped all of the non alpha-num characters you'd likely get less > brittle results. > > You know your problem domain better than I do though, so whatever > makes most sense. > > Best, > Erick > > On Tue, Apr 8, 2014 at 6:55 AM, Ahmet Arslan <iori...@yahoo.com> wrote: >> Hi Peter, >> >> TermQueryParser is useful in your case. >> q={!term f=categories_string}A|B|D (E & F) >> >> >> >> On Tuesday, April 8, 2014 4:37 PM, Peter Kirk <p...@alpha-solutions.dk> >> wrote: >> Hi >> >> How to search for Solr special characters like '(' and '&'? >> >> I am trying to execute searches for "products" in my Solr (3.6.1) index, >> based on the "categories" to which these products belong. >> The categories are stored in a multistring field for the products, and are >> hierarchical, and are fed to the index like: >> A >> A|B >> A|B|C >> >> So this product would actually belong to category named "C", which is a >> child of "B", which is a child of !"A". >> >> I am able to execute queries for simple category names like this (eg. >> fq=categories_string:A|B|C). >> >> But some categories have Solr special characters in their names, like: "D (E >> & F)" >> (Real example: "Power supplies (Battery and Solar)"). >> >> A query like fq=categories_string:A|B|D (E & F) simply fails. >> But even if I try >> fq=categories_string:A|B|D%20\(E%20%26amp%3B%20F\) >> (where I try to escape the special characters) does not find the products in >> this category, and actually finds other unrelated categories. >> >> What am I doing wrong? >> >> Thanks, >> Peter >> -- Philip Durbin Software Developer for http://thedata.org http://www.iq.harvard.edu/people/philip-durbin