I’ve been wanting a “free text” query parser for a while. We could build some cool stuff on that: auto-phrasing, entity extraction and weighting, CJK tokenization, …
For reference, here are some real-world user queries I have needed to deal with. These have exactly matched content. * +/- * .hack//Roots * p=mv wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Apr 20, 2015, at 5:52 PM, Steven White <swhite4...@gmail.com> wrote: > Hi Erick, > > I think you missed my point. My request is, Solr support a new URL > parameter. If this parameter is set, than EVERYTHING in q is treated as > raw text (i.e.: Solr will do the escaping vs. the client). > > Thanks > > Steve > > On Mon, Apr 20, 2015 at 1:08 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> How does that address the example query I gave? >> >> q=field1:whatever AND (a AND field:b) OR (field2:c AND "d: is a letter >> followed by a colon (:)"). >> >> bq: "Solr will treat everything in the search string by first passing >> it to ClientUtils.escapeQueryChars()." >> >> would incorrectly escape the colons after field1, field, field2 and >> correctly escape the colon after d and in parens. And parens are a >> reserved character too, so it would incorrectly escape _all_ the >> parens except the ones surrounding the colon. >> >> The list of reserved characters is pretty unchanging, so I don't think >> it's too much to ask the app layer, which knows (at least it better >> know) which bits of the query were user entered, what rules apply as >> to whether the user can enter field-qualified searches etc. Only armed >> with that knowledge can the right thing be done, and Solr has no >> knowledge of those rules. >> >> If you insist that the client shouldn't deal with that, you could >> always write a custom component that enforces the rules that are >> particular to your setup. For instance, you may have a rule that you >> can never field-qualify any term, in which case escaping on the Solr >> side would work in _your_ situation. But the general case just doesn't >> fit into the "escape on the Solr side" paradigm. >> >> Best, >> Erick >> >> >> On Mon, Apr 20, 2015 at 9:55 AM, Steven White <swhite4...@gmail.com> >> wrote: >>> Hi Erick, >>> >>> I didn't know about ClientUtils.escapeQueryChars(), this is good to know. >>> Unfortunately I cannot use it because it means I have to import Solr >>> classes with my client application. I want to avoid that and create a >>> lose coupling between my application and Solr (just rely on REST). >>> >>> My suggestion is to add a new URL parameter to Solr, such as >>> "q.ignoreOperators=[true | false]" (or some other name). If this >> parameter >>> is set to "false" or is missing, than the current behavior takes effect, >> if >>> it is set to "true" than Solr will treat everything in the search string >> by >>> first passing it to ClientUtils.escapeQueryChars(). This way, the client >>> application doesn't have to: a) be tightly coupled with Solr (require to >>> link with Solr JARs to use escapeQueryChars), and b) keep up with Solr >> when >>> new operators are added. >>> >>> What do you think? >>> >>> Steve >>> >>> On Mon, Apr 20, 2015 at 12:41 PM, Erick Erickson < >> erickerick...@gmail.com> >>> wrote: >>> >>>> Steve: >>>> >>>> In short, no. There's no good way for Solr to solve this problem in >>>> the _general_ case. Well, actually we could create parsers with rules >>>> like "if the colon is inside a paren, escape it). Which would >>>> completely break someone who wants to form queries like >>>> >>>> q=field1:whatever AND (a AND field:b) OR (field2:c AND "d: is a letter >>>> followed by a colon (:)"). >>>> >>>> You say: " A better solution would be to have Solr support a new >>>> parameter that I can pass to Solr as part of the URL." >>>> >>>> How would Solr know _which_ parts of the URL to escape in the case >> above? >>>> >>>> You have to do this at the app layer as that's the only place that has >>>> a clue what the peculiarities of the situation are. >>>> >>>> But if you're using SolrJ in your app layer, you can use >>>> ClientUtils.escapeQueryChars() for user-entered data to do the >>>> escaping without you having to maintain a separate list. >>>> >>>> Best, >>>> Erick >>>> >>>> On Mon, Apr 20, 2015 at 8:39 AM, Steven White <swhite4...@gmail.com> >>>> wrote: >>>>> Hi Shawn, >>>>> >>>>> If the user types "title:(Apache: Solr Notes)" (without quotes) than I >>>> want >>>>> Solr to treat the whole string as raw text string as if I escaped ":", >>>> "(" >>>>> and ")" and any other reserved Solr keywords / tokens. Using dismax >> it >>>>> worked for the ":" case, but I still get SyntaxError if I pass it the >>>>> following "title:(Apache: Solr Notes) AND" (here is the full URL): >>>>> >>>>> >>>>> >>>> >> http://localhost:8983/solr/db/select?q=title:(Apache:%20Solr%20Notes)%20AND&fl=id%2Cscore%2Ctitle&wt=xml&indent=true&q.op=AND&defType=dismax&qf=title >>>>> >>>>> So far, the only solution I can find is for my application to escape >> all >>>>> Solr operators before sending the string to Solr. This is fine, but >> it >>>>> means my application will have to adopt to Solr's reserved operators >> as >>>>> Solr grows (if Solr 5.x / 6.x adds a new operator, I have to add that >> to >>>> my >>>>> applications escape list). A better solution would be to have Solr >>>> support >>>>> a new parameter that I can pass to Solr as part of the URL. >>>>> This parameter will tell Solr to do the escaping for me or not >> (missing >>>>> means the same as don't do the escaping). >>>>> >>>>> Thanks >>>>> >>>>> Steve >>>>> >>>>> On Mon, Apr 20, 2015 at 10:05 AM, Shawn Heisey <apa...@elyograg.org> >>>> wrote: >>>>> >>>>>> On 4/20/2015 7:41 AM, Steven White wrote: >>>>>>> In my application, a user types "Apache Solr Notes". I take that >> text >>>>>> and >>>>>>> send it over to Solr like so: >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >> http://localhost:8983/solr/db/select?q=title:(Apache%20Solr%20Notes)&fl=id%2Cscore%2Ctitle&wt=xml&indent=true&q.op=AND >>>>>>> >>>>>>> And I get a hit on "Apache Solr Release Notes". This is all good. >>>>>>> >>>>>>> Now if the same user types "Apache: Solr Notes" (notice the ":" >> after >>>>>>> "Apache") I will get a SyntaxError. The fix is to escape ":" >> before I >>>>>> send >>>>>>> it to Solr. What I want to figure out is how can I tell Solr / >>>> Lucene to >>>>>>> ignore ":" and escape it for me? In this example, I used ":" but >> my >>>> need >>>>>>> is for all other operators and reserved Solr / Lucene characters. >>>>>> >>>>>> If we assume that what you did for the first query is what you will >> do >>>>>> for the second query, then this is what you would have sent: >>>>>> >>>>>> q=title:(Apache: Solr Notes) >>>>>> >>>>>> How is the parser supposed to know that only the second colon should >> be >>>>>> escaped, and not the first one? If you escape them both (or treat >> the >>>>>> entire query string as query text), then the fact that you are >> searching >>>>>> the "title" field is lost. The text "title" becomes an actual part >> of >>>>>> the query, and may not match, depending on what you have done with >> other >>>>>> parameters, such as the default operator. >>>>>> >>>>>> If you use the dismax parser (*NOT* the edismax parser, which parses >>>>>> field:value queries and boolean operator syntax just like the lucene >>>>>> parser), you may be able to achieve what you're after. >>>>>> >>>>>> >>>> >> https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser >>>>>> https://wiki.apache.org/solr/DisMaxQParserPlugin >>>>>> >>>>>> With dismax, you would use the qf and possibly the pf parameter to >> tell >>>>>> it which fields to search and send this as the query: >>>>>> >>>>>> q=Apache: Solr Notes >>>>>> >>>>>> Thanks, >>>>>> Shawn >>>>>> >>>>>> >>>> >>