Re: Interesting issue with "special characters" in a string field value

Jack Park Sun, 24 Feb 2013 14:50:33 -0800

I did run attempt queries with and without escaping at the admin query
browser; made no difference. I seem to recall that the system did not
work without escaping, but it does seem worth blocking escaping and
testing again.


Many thanks
Jack

On Sun, Feb 24, 2013 at 1:16 PM, Michael Della Bitta
<michael.della.bi...@appinions.com> wrote:
> Hello Jack,
>
> I'm not sure if this is an option for you, but if you submit and
> retrieve your documents using only SolrJ, you won't have to worry
> about escaping them for encoding into a particular document format.
> SolrJ would handle that for you.
>
> Michael Della Bitta
>
> ------------------------------------------------
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Sun, Feb 24, 2013 at 12:29 AM, Jack Park <jackp...@topicquests.org> wrote:
>> Ok. I have revisited this issue as deeply as possible using simplistic
>> unit tests, tossing out indexes, and starting fresh.
>>
>> A typical Solr document might have a label, e.g. the string inside the
>> quotes: "Node Type".  That would be queried, according to what I've
>> been able to read, as a Phrase Query, which means, include the quotes
>> around the text.
>>
>> When I use the admin query panel with this query:
>> label:"Node Type"
>> A fragment of the full document is returned. it is this:
>>
>>   <doc>
>>     <str name="locator">NodeType</str>
>>     <arr name="label">
>>       <str>Node Type</str>
>>     </arr>
>>
>> In my code using SolrJ, I have printlines just as the "escaped" query
>> string comes in, and one which shows what the SolrQuery looks like
>> after setting it up to go online. I then show what came back:
>>
>> Solr3Client.runQuery- label:"Node Type" 0 10
>> Solr3Client.runQuery-1 q=label%3A%22Node+Type%22&start=0&rows=10
>> ZZZZ {numFound=1,start=0,docs=[SolrDocument{locator=NodeType,
>> smallIcon=cogwheel.png, subOf=ClassType, details=The TopicQuests
>> typology node type., isPrivate=false, creatorId=SystemUser, label=Node
>> Type, largeIcon=cogwheel.png, lastEditDate=Sat Feb 23 20:43:22 PST
>> 2013, createdDate=Sat Feb 23 20:43:22 PST 2013,
>> _version_=1427826019119661056}]}
>>
>> What that says is that SolrQuery inserted a + inside the query string,
>> and that it found 1 document, but did not return it.
>>
>> In the largest picture, I have returned to using XMLResponseParser on
>> the theory that I will now be able to take advantage of partialUpdates
>> on multi-valued fields (List<String>) but haven't tested that yet. I
>> am not yet escaping such things as "<" or ">" but just escaping those
>> things mentioned in the Solr documents which are reserved characters.
>>
>> So, the current update is this: learning about phrase queries, and
>> judicious escaping of reserved characters seems to be helping. Next up
>> entails two issues: more robust testing of escaped characters, and
>> trying to discover what is the best approach to dealing with
>> characters that must be escaped to get past XML, e.g. '<', '>', and
>> others.
>>
>> Many thanks
>> Jack
>>
>>
>> On Fri, Feb 22, 2013 at 2:44 PM, Jack Park <jackp...@topicquests.org> wrote:
>>> Michael,
>>> I don't think you misunderstood. I will soon give a full response here, but
>>> am on the road at the moment.
>>>
>>> Many thanks
>>> Jack
>>>
>>>
>>> On Friday, February 22, 2013, Michael Della Bitta
>>> <michael.della.bi...@appinions.com> wrote:
>>>> My mistake, I misunderstood the problem.
>>>>
>>>> Michael Della Bitta
>>>>
>>>> ------------------------------------------------
>>>> Appinions
>>>> 18 East 41st Street, 2nd Floor
>>>> New York, NY 10017-6271
>>>>
>>>> www.appinions.com
>>>>
>>>> Where Influence Isn’t a Game
>>>>
>>>>
>>>> On Fri, Feb 22, 2013 at 3:55 PM, Chris Hostetter
>>>> <hossman_luc...@fucit.org> wrote:
>>>>>
>>>>> : If you're submitting documents as XML, you're always going to have to
>>>>> : escape meaningful XML characters going in. If you ask for them back as
>>>>> : XML, you should be prepared to unescape special XML characters as
>>>>>
>>>>> that still wouldn't explain the discrepency he's claiming to see between
>>>>> the json & xml resmonses (the json containing an empty string
>>>>>
>>>>> Jack: please elaborate with specifics about your solr version, field,
>>>>> field type, how you indexed your doc, and what the request urls & raw
>>>>> responses that you get are (ie: don't trust the XML you see in your
>>>>> browser, it may be unescaping escaped sequences in element text to be
>>>>> "helpful" .. use something like curl)
>>>>>
>>>>> For example...
>>>>>
>>>>> ----BEGIN GOOD EXAMPLE OF SPECIFICS---
>>>>>
>>>>> I'm using Solr 4.x with the 4.x example schema which has the following
>>>>> field...
>>>>>
>>>>>    <field name="cat" type="string" indexed="true" stored="true"
>>>>> multiValued="true"/>
>>>>>    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
>>>>> />
>>>>>
>>>>> I indexed a doc like this...
>>>>>
>>>>> $ curl "http://localhost:8983/solr/update?commit=true"; -H
>>>>> 'Content-type:application/json' -d '[{"id":"hoss", "cat":"<Something to 
>>>>> use
>>>>> as a source node>" } ]'
>>>>>
>>>>> And this is what i get from the following requests...
>>>>>
>>>>> $ curl
>>>>> "http://localhost:8983/solr/select?q=id:hoss&wt=xml&indent=true&omitHeader=true";
>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>> <response>
>>>>>
>>>>> <result name="response" numFound="1" start="0">
>>>>>   <doc>
>>>>>     <str name="id">hoss</str>
>>>>>     <arr name="cat">
>>>>>       <str>&lt;Something to use as a source node&gt;</str>
>>>>>     </arr>
>>>>>     <long name="_version_">1427705631375097856</long></doc>
>>>>> </result>
>>>>> </response>
>>>>>
>>>>> $ curl
>>>>> "http://localhost:8983/solr/select?q=id:hoss&wt=json&indent=true&omitHeader=true";
>>>>> {
>>>>>   "response":{"numFound":1,"start":0,"docs":[
>>>>>       {
>>>>>         "id":"hoss",
>>>>>         "cat":["<Something to use as a source node>"],
>>>>>         "_version_":1427705631375097856}]
>>>>>   }}
>>>>>
>>>>> $ curl
>>>>> "http://localhost:8983/solr/select?q=cat:%22<Something+to+use+as+a+source+node>%22&wt=json&indent=true&omitHeader=true"
>>>>> {
>>>>>   "response":{"numFound":1,"start":0,"docs":[
>>>>>       {
>>>>>         "id":"hoss",
>>>>>         "cat":["<Something to use as a source node>"],
>>>>>         "_version_":1427705631375097856}]
>>>>>   }}
>>>>>
>>>>> ----END GOOD EXAMPLE OF SPECIFICS---
>>>>>
>>>>> : > Even more curious, if I use this query at the console:
>>>>> : >
>>>>> : > details:<Something to use as a source node>
>>>>> : >
>>>>> : > I get nothing back.
>>>>>
>>>>> note in my last example above the importance of using quotes (or the
>>>>> {!term} qparser) to query string fields that contain special characters
>>>>> like whitespace -- whitespace is syntacally meaningul to the lucene query
>>>>> parser, it seperates clauses of a boolean query.
>>>>>
>>>>>
>>>>> -Hoss
>>>>

Re: Interesting issue with "special characters" in a string field value

Reply via email to