Re: Atomic update wrongly deletes child documents
Hi, I was able to work around the issue. I'm now using a custom UpdateRequestProcessor that removes undefined fields, so that I was able to remove the catch-all dynamic field "ignored" from my schema.. Of course, one has to be careful to not remove fields that are used for nested documents in the URP. I think it would still make sense to fix the original issue, or at least document it as caveat. I'm going to create a JIRA ticket for this soon, if that's okay. Regards, Andreas -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Use stream result like a query (alternative to innerJoin)
Fetch would work for my specific case (since I’m working with id’s there’s no one to many), if I was able to restrict fetch’s target domain with a query. I would first get all possible deleted ids, then use fetch to the items collection. But then the current fetch implementation would find all deleted items, not something like “deleted items with these names” or “deleted items between this time” etc. I came upon your video while researching this stuff: https://www.youtube.com/watch?v=kTNe3TaqFvo I’m trying to use the “let” expression to feed one stream’s result to another as a query, using string concat function and eval stream. So far I couldn’t write a working example, but it’s an idea that I’m playing with. Sent from Mail for Windows 10 From: Joel Bernstein Sent: 23 November 2020 23:23 To: solr-user@lucene.apache.org Subject: Re: Use stream result like a query (alternative to innerJoin) H
Re: Atomic update wrongly deletes child documents
Sure, raise a JIRA. Thanks for the update... > On Nov 24, 2020, at 4:12 AM, Andreas Hubold > wrote: > > Hi, > > I was able to work around the issue. I'm now using a custom > UpdateRequestProcessor that removes undefined fields, so that I was able to > remove the catch-all dynamic field "ignored" from my schema.. Of course, one > has to be careful to not remove fields that are used for nested documents in > the URP. > > I think it would still make sense to fix the original issue, or at least > document it as caveat. I'm going to create a JIRA ticket for this soon, if > that's okay. > > Regards, > Andreas > > > > -- > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Query generation is different for search terms with and without "-"
This is a common point of confusion. There are two phases for creating a query, query _parsing_ first, then the analysis chain for the parsed result. So what e-dismax sees in the two cases is: Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes into play. Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, splitting it on the hyphen comes later. It’s especially confusing since the field analysis then breaks up “high-tech” into two tokens that look the same as “high tech” in the debug response, just without the phrase query. Name_enUS:high Name_enUS:tech Best, Erick > On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez > wrote: > > I am troubleshooting an issue with ranking for search terms that contain a > "-" vs the same query that does not contain the dash e.g. "high-tech" vs > "high tech". The field that I am querying is using the standard tokenizer, > so I would expect that the underlying lucene query should be the same for > both versions of the query, however when printing the debug, it appears > they are generated differently. I know "-" must be escaped as it has > special meaning in lucene, however escaping does not fix the problem. It > appears that with the "-" present, the pf2 edismax parameter is not > respected and omitted from the final query. We use sow=false as we have > multiterm synonyms and need to ensure they are included in the final lucene > query. My expectation is that the final underlying lucene query should be > based on the output of the field analyzer, however after briefly looking > at the code for ExtendedDismaxQParser, it appears that there is some string > processing happening outside of the analysis step which causes the > unexpected lucene query. > > > Solr Debug for "high tech": > > parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4) > DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2 > DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4) > DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)", > parsedquery_toString: "+(((Name_enUS:high)~0.4 > (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4 > (Name_enUS:"high tech"~4)~0.4", > > > Solr Debug for "high-tech" > > parsedquery: "+DisjunctionMaxQueryName_enUS:high > Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high > tech"~5)~0.4)", > parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4 > (Name_enUS:"high tech"~5)~0.4" > > SolrConfig: > > > > true > true > json > 3<75% > Name_enUS > Name_enUS > 5 > Name_enUS > 4 > 3 > 0.4 > explicit > 100 > false > > > edismax > > > > Schema: > > > > > > > > > > > > Using Solr 8.6.3 > > -- > *The information contained in this message is the sole and exclusive > property of ***iHerb Inc.*** and may be privileged and confidential. It may > not be disseminated or distributed to persons or entities other than the > ones intended without the written authority of ***iHerb Inc.** *If you have > received this e-mail in error or are not the intended recipient, you may > not use, copy, disseminate or distribute it. Do not open any attachments. > Please delete it immediately from your system and notify the sender > promptly by e-mail that you have done so.*
Re: Query generation is different for search terms with and without "-"
Is the normal/standard solution here to regex remove the '-'s and combine them into a single token? On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson wrote: > > This is a common point of confusion. There are two phases for creating a > query, > query _parsing_ first, then the analysis chain for the parsed result. > > So what e-dismax sees in the two cases is: > > Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes > into play. > > Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, > splitting it on the hyphen comes later. > > It’s especially confusing since the field analysis then breaks up “high-tech” > into two tokens that > look the same as “high tech” in the debug response, just without the phrase > query. > > Name_enUS:high > Name_enUS:tech > > Best, > Erick > > > On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez > > wrote: > > > > I am troubleshooting an issue with ranking for search terms that contain a > > "-" vs the same query that does not contain the dash e.g. "high-tech" vs > > "high tech". The field that I am querying is using the standard tokenizer, > > so I would expect that the underlying lucene query should be the same for > > both versions of the query, however when printing the debug, it appears > > they are generated differently. I know "-" must be escaped as it has > > special meaning in lucene, however escaping does not fix the problem. It > > appears that with the "-" present, the pf2 edismax parameter is not > > respected and omitted from the final query. We use sow=false as we have > > multiterm synonyms and need to ensure they are included in the final lucene > > query. My expectation is that the final underlying lucene query should be > > based on the output of the field analyzer, however after briefly looking > > at the code for ExtendedDismaxQParser, it appears that there is some string > > processing happening outside of the analysis step which causes the > > unexpected lucene query. > > > > > > Solr Debug for "high tech": > > > > parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4) > > DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2 > > DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4) > > DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)", > > parsedquery_toString: "+(((Name_enUS:high)~0.4 > > (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4 > > (Name_enUS:"high tech"~4)~0.4", > > > > > > Solr Debug for "high-tech" > > > > parsedquery: "+DisjunctionMaxQueryName_enUS:high > > Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high > > tech"~5)~0.4)", > > parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4 > > (Name_enUS:"high tech"~5)~0.4" > > > > SolrConfig: > > > > > > > > true > > true > > json > > 3<75% > > Name_enUS > > Name_enUS > > 5 > > Name_enUS > > 4 > > 3 > > 0.4 > > explicit > > 100 > > false > > > > > > edismax > > > > > > > > Schema: > > > > > positionIncrementGap="100"> > > > > > > > > > > > > > > > > > > > > Using Solr 8.6.3 > >
Re: disallowing delete through security.json
Hey Craig, I think this will be tricky to do with the current Rule-Based Authorization support. As you pointed out in your initial post - there are lots of ways to delete documents. The Rule-Based Auth code doesn't inspect request bodies (AFAIK), so it's going to have trouble differentiating between traditional "/update" requests with method=POST that are request-body driven. But to zoom out a bit, does it really make sense to lock down deletes, but not updates more broadly? After all, "updates" can remove and add fields. Users might submit an update that strips everything but "id" from your documents. In many/most usecases that'd be equally concerning. Just wondering what your usecase is - if it's generally applicable this is probably worth a JIRA ticket. Best, Jason On Thu, Nov 19, 2020 at 10:34 AM Oakley, Craig (NIH/NLM/NCBI) [C] wrote: > > Having not heard back, I thought I would ask again whether anyone else has > been able to use security.json to disallow deletes, and/or if anyone has > examples of using the "method" section in > lucene.apache.org/solr/guide/8_4/rule-based-authorization-plugin.html > > -Original Message- > From: Oakley, Craig (NIH/NLM/NCBI) [C] > Sent: Monday, October 26, 2020 6:23 PM > To: solr-user@lucene.apache.org > Subject: disallowing delete through security.json > > I am interested in disallowing delete through security.json > > After seeing the "method" section in > lucene.apache.org/solr/guide/8_4/rule-based-authorization-plugin.html my > first attempt was as follows: > > {"set-permission":{ > "name":"NO_delete", > "path":["/update/*","/update"], > "collection":col_name, > "role":"NoSuchRole", > "method":"DELETE", > "before":4}} > > I found, however, that this did not disallow deleted: I could still run > curl -u ... "http://.../solr/col_name/update?commit=true"; --data > "id:11" > > After further experimentation, I seemed to have success with > {"set-permission": > {"name":"NO_delete6", > "path":"/update/*", > "collection":"col_name", > "role":"NoSuchRole", > "method":["REGEX:(?i)DELETE"], > "before":4}} > > My initial impression was that this did what I wanted; but now I find that > this disallows *any* updates to this collection (which had previously been > allowed). Other attempts to tweak this strategy, such as granting permissions > for "/update/*" for methods other than DELETE to a role which is granted to > the desired user, have not yet been successful. > > Does anyone have an example of security.json disallowing a delete while still > allowing an update? > > Thanks
Re: Query generation is different for search terms with and without "-"
Are there any good workarounds/parameters we can use to fix this so it doesn't have to be solved client side? On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder wrote: > Is the normal/standard solution here to regex remove the '-'s and > combine them into a single token? > > On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson > wrote: > > > > This is a common point of confusion. There are two phases for creating a > query, > > query _parsing_ first, then the analysis chain for the parsed result. > > > > So what e-dismax sees in the two cases is: > > > > Name_enUS:“high tech” -> two tokens, since there are two of them pf2 > comes into play. > > > > Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, > splitting it on the hyphen comes later. > > > > It’s especially confusing since the field analysis then breaks up > “high-tech” into two tokens that > > look the same as “high tech” in the debug response, just without the > phrase query. > > > > Name_enUS:high > > Name_enUS:tech > > > > Best, > > Erick > > > > > On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez < > samuel.gutier...@iherb.com.INVALID> wrote: > > > > > > I am troubleshooting an issue with ranking for search terms that > contain a > > > "-" vs the same query that does not contain the dash e.g. "high-tech" > vs > > > "high tech". The field that I am querying is using the standard > tokenizer, > > > so I would expect that the underlying lucene query should be the same > for > > > both versions of the query, however when printing the debug, it appears > > > they are generated differently. I know "-" must be escaped as it has > > > special meaning in lucene, however escaping does not fix the problem. > It > > > appears that with the "-" present, the pf2 edismax parameter is not > > > respected and omitted from the final query. We use sow=false as we have > > > multiterm synonyms and need to ensure they are included in the final > lucene > > > query. My expectation is that the final underlying lucene query should > be > > > based on the output of the field analyzer, however after briefly > looking > > > at the code for ExtendedDismaxQParser, it appears that there is some > string > > > processing happening outside of the analysis step which causes the > > > unexpected lucene query. > > > > > > > > > Solr Debug for "high tech": > > > > > > parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4) > > > DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2 > > > DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4) > > > DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)", > > > parsedquery_toString: "+(((Name_enUS:high)~0.4 > > > (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4 > > > (Name_enUS:"high tech"~4)~0.4", > > > > > > > > > Solr Debug for "high-tech" > > > > > > parsedquery: "+DisjunctionMaxQueryName_enUS:high > > > Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high > > > tech"~5)~0.4)", > > > parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4 > > > (Name_enUS:"high tech"~5)~0.4" > > > > > > SolrConfig: > > > > > > > > > > > > true > > > true > > > json > > > 3<75% > > > Name_enUS > > > Name_enUS > > > 5 > > > Name_enUS > > > 4 > > > 3 > > > 0.4 > > > explicit > > > 100 > > > false > > > > > > > > > edismax > > > > > > > > > > > > Schema: > > > > > > positionIncrementGap="100"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Using Solr 8.6.3 > > > > -- *The information contained in this message is the sole and exclusive property of ***iHerb Inc.*** and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of ***iHerb Inc.** *If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.*
RE: disallowing delete through security.json
Thank you for the response The use case I have in mind is trying to approximate incremental updates (as are available in Sybase or MSSQL, to which I am more accustomed). We are wanting to upgrade a large collection from Solr7.4 to Solr8.5. It turns out that Solr8.5 cannot run against the current data, because the collection was created under Solr6.6. We want to migrate in such a way that, in a year or so, we will be able to migrate to Solr9 without worrying about Solr7.4 let alone Solr6.6. We want to create a new collection (of the same name) in a brand new Solr8.5 SolrCloud, and then to select everything from the current Solr7.4 collection in json format and load it into the new Solr8.5 collection. All of the fields have stored="true", with the exception of fields populated by copyField. The select will be done by ranges of id values, so as to avoid OutOfMemory errors. That process will take several days; and in the meanwhile, users will be continuing to add data. When all the data will have been copied (including that which is described below), we can switch port numbers so that the new Solr8.5 SolrCloud takes the place of the old Solr7.4 SolrCloud. The plan is to find a value of _version_ (call it V1) which was in the Solr7.4 collection when we started the first select, but which is greater than almost all values of _version_ in the collection (we are fine with having an overlap of _version_ values, but we want to avoid losing anything by having a gap in _version_ values). After the initial selects are complete, we can run other selects by ranges of id with the additional criteria that the _version_ will be no lower than the V1 value. As we have seen in test runs, this will involve less data and will run faster. We will also keep note of a new value of _version_ (call it V2) which was in the Solr7.4 collection when we start the V1 select, but which is greater than almost all values of _version_ in the V1 select. Following this procedure through various iterations (V3, V4, however many it takes), we can load the V1 set of selects when we will have completed the loading of the initial set of selects. We can then load the V2 set of selects when we will have completed the loading of the V1 set of selects. The plan is that the selecting and loading of the last Vn set of selects will involve a maintenance window measured in minutes rather than in days. The users claim that they never do deletes: which is good, because a delete would be something which would be missed by this plan. If (as you describe) the users were to update a record so that only the id field (and the _version_ field) are left, that update would get picked up by one of these incremental selects and would be applied to the new collection. A delete, however, would not be noticed: and the new Solr8.5 collection would still have the record which had been deleted from the old Solr7.4 collection. The users claim that they never do deletes: but it would seem safer to actually disallow deletes during the maintenance. Let me know if you have any suggestions. Thank you again for your reply. -Original Message- From: Jason Gerlowski Sent: Tuesday, November 24, 2020 12:35 PM To: solr-user@lucene.apache.org Subject: Re: disallowing delete through security.json Hey Craig, I think this will be tricky to do with the current Rule-Based Authorization support. As you pointed out in your initial post - there are lots of ways to delete documents. The Rule-Based Auth code doesn't inspect request bodies (AFAIK), so it's going to have trouble differentiating between traditional "/update" requests with method=POST that are request-body driven. But to zoom out a bit, does it really make sense to lock down deletes, but not updates more broadly? After all, "updates" can remove and add fields. Users might submit an update that strips everything but "id" from your documents. In many/most usecases that'd be equally concerning. Just wondering what your usecase is - if it's generally applicable this is probably worth a JIRA ticket. Best, Jason On Thu, Nov 19, 2020 at 10:34 AM Oakley, Craig (NIH/NLM/NCBI) [C] wrote: > > Having not heard back, I thought I would ask again whether anyone else has > been able to use security.json to disallow deletes, and/or if anyone has > examples of using the "method" section in > lucene.apache.org/solr/guide/8_4/rule-based-authorization-plugin.html > > -Original Message- > From: Oakley, Craig (NIH/NLM/NCBI) [C] > Sent: Monday, October 26, 2020 6:23 PM > To: solr-user@lucene.apache.org > Subject: disallowing delete through security.json > > I am interested in disallowing delete through security.json > > After seeing the "method" section in > lucene.apache.org/solr/guide/8_4/rule-based-authorization-plugin.html my > first attempt was as follows: > > {"set-permission":{ > "name":"NO_delete", > "path":["/update/*","/update"], > "collection":col_name, > "role":"NoSuchR