Filtering out special characters sounds like a good idea, or possibly
escaping some of them. I definitely want to avoid brittleness.

Right now I'm passing the query relatively "as is" which means users
can type "title:foo" to find documents that have "foo" in the "title"
field. But a query for just a colon (":") throws an error
(org.apache.solr.search.SyntaxError: Cannot parse ':') so obviously I
need to do more processing of the query before I pass it to Solr. I
need to escape that colon or something.

Is there some general advice on doing some sanity checks or escaping
special characters on user-supplied queries before you pass them to
Solr? Is it documented in the wiki? I'm using Solrj but I imagine the
advice applies to everyone.

Phil

p.s. I noticed a note saying "These characters are part of the query
syntax and must be escaped" at
https://github.com/apache/lucene-solr/blob/lucene_solr_4_7_0/solr/solrj/src/java/org/apache/solr/client/solrj/util/ClientUtils.java#L231
and learned of this part of the code from
http://lucene.472066.n3.nabble.com/What-is-the-full-list-of-Solr-Special-Characters-td4094053.html

On Tue, Apr 8, 2014 at 10:14 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> I'd seriously consider filtering these characters out when you index
> and search, this is quite likely very brittle. The same item, say from
> two different vendors, might have D (E & F) or D E & F. If you just
> stripped all of the non alpha-num characters you'd likely get less
> brittle results.
>
> You know your problem domain better than I do though, so whatever
> makes most sense.
>
> Best,
> Erick
>
> On Tue, Apr 8, 2014 at 6:55 AM, Ahmet Arslan <iori...@yahoo.com> wrote:
>> Hi Peter,
>>
>> TermQueryParser is useful in your case.
>> q={!term f=categories_string}A|B|D (E & F)
>>
>>
>>
>> On Tuesday, April 8, 2014 4:37 PM, Peter Kirk <p...@alpha-solutions.dk> 
>> wrote:
>> Hi
>>
>> How to search for Solr special characters like '(' and '&'?
>>
>> I am trying to execute searches for "products" in my Solr (3.6.1) index, 
>> based on the "categories" to which these products belong.
>> The categories are stored in a multistring field for the products, and are 
>> hierarchical, and are fed to the index like:
>> A
>> A|B
>> A|B|C
>>
>> So this product would actually belong to category named "C", which is a 
>> child of "B", which is a child of !"A".
>>
>> I am able to execute queries for simple category names like this (eg. 
>> fq=categories_string:A|B|C).
>>
>> But some categories have Solr special characters in their names, like: "D (E 
>> & F)"
>> (Real example: "Power supplies (Battery and Solar)").
>>
>> A query like fq=categories_string:A|B|D (E & F) simply fails.
>> But even if I try
>> fq=categories_string:A|B|D%20\(E%20%26amp%3B%20F\)
>> (where I try to escape the special characters) does not find the products in 
>> this category, and actually finds other unrelated categories.
>>
>> What am I doing wrong?
>>
>> Thanks,
>> Peter
>>



-- 
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin

Reply via email to