[jira] [Commented] (SOLR-14944) solr metrics should remove "spins" references
[ https://issues.apache.org/jira/browse/SOLR-14944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219616#comment-17219616 ] ASF subversion and git services commented on SOLR-14944: Commit 97551dd644b94390f696c907d94ed602657844db in lucene-solr's branch refs/heads/branch_8x from Andrzej Bialecki [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=97551dd ] SOLR-14944: Fix the description to reflect the fact that this is not removed in 8.7. > solr metrics should remove "spins" references > - > > Key: SOLR-14944 > URL: https://issues.apache.org/jira/browse/SOLR-14944 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Affects Versions: master (9.0) >Reporter: Robert Muir >Assignee: Andrzej Bialecki >Priority: Major > Fix For: master (9.0), 8.7 > > > The lucene internal IOUtils.spins stuff was exposed in various ways here, in > order to not break stuff in LUCENE-9576 I simply wired these apis to > {{false}}, but they should probably be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14944) solr metrics should remove "spins" references
[ https://issues.apache.org/jira/browse/SOLR-14944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219615#comment-17219615 ] ASF subversion and git services commented on SOLR-14944: Commit b8a3d11c47d22f5b61ceacb6b289ab19ee69dfdc in lucene-solr's branch refs/heads/branch_8_7 from Andrzej Bialecki [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b8a3d11 ] SOLR-14944: Fix the description to reflect the fact that this is not removed in 8.7. > solr metrics should remove "spins" references > - > > Key: SOLR-14944 > URL: https://issues.apache.org/jira/browse/SOLR-14944 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Affects Versions: master (9.0) >Reporter: Robert Muir >Assignee: Andrzej Bialecki >Priority: Major > Fix For: master (9.0), 8.7 > > > The lucene internal IOUtils.spins stuff was exposed in various ways here, in > order to not break stuff in LUCENE-9576 I simply wired these apis to > {{false}}, but they should probably be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13973) Deprecate Tika
[ https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219642#comment-17219642 ] Jan Høydahl commented on SOLR-13973: I'm aware of that design doc, and my comment was mostly referring to the "slim" distro, where 1st party packages needs to be released outside of the solr tarball, but still an official release from the project. > Deprecate Tika > -- > > Key: SOLR-13973 > URL: https://issues.apache.org/jira/browse/SOLR-13973 > Project: Solr > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Priority: Major > Fix For: 8.7 > > Time Spent: 10m > Remaining Estimate: 0h > > Solr's primary responsibility should be to focus on search and scalability. > Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us > down. I propose that we deprecate it going forward. > Tika can be run outside Solr. Going forward, if someone wants to use these, > it should be possible to bring them into third party packages and installed > via package manager. > Plan is to just to throw warnings in logs and add deprecation notes in > reference guide for now. Removal can be done in 9.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on pull request #1993: .gitignore clean up
dsmiley commented on pull request #1993: URL: https://github.com/apache/lucene-solr/pull/1993#issuecomment-715309491 @msokolov you added `.#*` -- what comment should I use in this file to explain what this is? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14354) HttpShardHandler send requests in async
[ https://issues.apache.org/jira/browse/SOLR-14354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219649#comment-17219649 ] David Smiley commented on SOLR-14354: - FYI I did last night and +1'ed it. BUILD SUCCESSFUL Total time: 42 minutes 55 seconds > HttpShardHandler send requests in async > --- > > Key: SOLR-14354 > URL: https://issues.apache.org/jira/browse/SOLR-14354 > Project: Solr > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Blocker > Fix For: master (9.0), 8.7 > > Attachments: image-2020-03-23-10-04-08-399.png, > image-2020-03-23-10-09-10-221.png, image-2020-03-23-10-12-00-661.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > h2. 1. Current approach (problem) of Solr > Below is the diagram describe the model on how currently handling a request. > !image-2020-03-23-10-04-08-399.png! > The main-thread that handles the search requests, will submit n requests (n > equals to number of shards) to an executor. So each request will correspond > to a thread, after sending a request that thread basically do nothing just > waiting for response from other side. That thread will be swapped out and CPU > will try to handle another thread (this is called context switch, CPU will > save the context of the current thread and switch to another one). When some > data (not all) come back, that thread will be called to parsing these data, > then it will wait until more data come back. So there will be lots of context > switching in CPU. That is quite inefficient on using threads.Basically we > want less threads and most of them must busy all the time, because threads > are not free as well as context switching. That is the main idea behind > everything, like executor > h2. 2. Async call of Jetty HttpClient > Jetty HttpClient offers async API like this. > {code:java} > httpClient.newRequest("http://domain.com/path";) > // Add request hooks > .onRequestQueued(request -> { ... }) > .onRequestBegin(request -> { ... }) > // Add response hooks > .onResponseBegin(response -> { ... }) > .onResponseHeaders(response -> { ... }) > .onResponseContent((response, buffer) -> { ... }) > .send(result -> { ... }); {code} > Therefore after calling {{send()}} the thread will return immediately without > any block. Then when the client received the header from other side, it will > call {{onHeaders()}} listeners. When the client received some {{byte[]}} (not > all response) from the data it will call {{onContent(buffer)}} listeners. > When everything finished it will call {{onComplete}} listeners. One main > thing that will must notice here is all listeners should finish quick, if the > listener block, all further data of that request won’t be handled until the > listener finish. > h2. 3. Solution 1: Sending requests async but spin one thread per response > Jetty HttpClient already provides several listeners, one of them is > InputStreamResponseListener. This is how it is get used > {code:java} > InputStreamResponseListener listener = new InputStreamResponseListener(); > client.newRequest(...).send(listener); > // Wait for the response headers to arrive > Response response = listener.get(5, TimeUnit.SECONDS); > if (response.getStatus() == 200) { > // Obtain the input stream on the response content > try (InputStream input = listener.getInputStream()) { > // Read the response content > } > } {code} > In this case, there will be 2 thread > * one thread trying to read the response content from InputStream > * one thread (this is a short-live task) feeding content to above > InputStream whenever some byte[] is available. Note that if this thread > unable to feed data into InputStream, this thread will wait. > By using this one, the model of HttpShardHandler can be written into > something like this > {code:java} > handler.sendReq(req, (is) -> { > executor.submit(() -> > try (is) { > // Read the content from InputStream > } > ) > }) {code} > The first diagram will be changed into this > !image-2020-03-23-10-09-10-221.png! > Notice that although “sending req to shard1” is wide, it won’t take long time > since sending req is a very quick operation. With this operation, handling > threads won’t be spin up until first bytes are sent back. Notice that in this > approach we still have active threads waiting for more data from InputStream > h2. 4. Solution 2: Buffering data and handle it inside jetty’s thread. > Jetty have another listener called BufferingResponseListener. This is how it > is get used > {code:java} > client.newRequest(...).send(new BufferingResponseListener() { > public void onComplete(Result result) { > try { > byte[
[GitHub] [lucene-solr] dsmiley closed pull request #1109: More pervasive use of PackageLoader / PluginInfo
dsmiley closed pull request #1109: URL: https://github.com/apache/lucene-solr/pull/1109 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gus-asf commented on a change in pull request #1995: LUCENE-9575 Add PatternTypingFilter to annotate tokens with flags and types
gus-asf commented on a change in pull request #1995: URL: https://github.com/apache/lucene-solr/pull/1995#discussion_r510861146 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilter.java ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.pattern; + +import java.io.IOException; +import java.util.LinkedHashMap; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.FlagsAttribute; +import org.apache.lucene.analysis.tokenattributes.TypeAttribute; + +/** + * Set a type attribute to a parameterized value when tokens are matched by any of a several regex patterns. The + * value set in the type attribute is parameterized with the match groups of the regex used for matching. + * In combination with TypeAsSynonymFilter and DropIfFlagged filter this can supply complex synonym patterns + * that are protected from subsequent analysis, and optionally drop the original term based on the flag + * set in this filter. See {@link PatternTypingFilterFactory} for full documentation. + * + * @since 8.8.0 + * @see PatternTypingFilterFactory + */ +public class PatternTypingFilter extends TokenFilter { + + private final Map patterns; + private final Map flags; + private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); + private final FlagsAttribute flagAtt = addAttribute(FlagsAttribute.class); + private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class); + + public PatternTypingFilter(TokenStream input, LinkedHashMap patterns, Map flags) { +super(input); +this.patterns = patterns; +this.flags = flags; + } + + @Override + public final boolean incrementToken() throws IOException { +if (input.incrementToken()) { + if (hasAttribute(CharTermAttribute.class)) { +String termText = termAtt.toString(); +for (Map.Entry patRep : patterns.entrySet()) { + Pattern pattern = patRep.getKey(); + Matcher matcher = pattern.matcher(termText); + String replaced = matcher.replaceFirst(patRep.getValue()); + // N.B. Does not support producing a synonym identical to the original term. + // Avoids having to match() then replace() which performs a second find(). + if (!replaced.equals(termText)) { Review comment: @uschindler any further thoughts? If you agree with the above, I think all would be resolved. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] shalinmangar commented on pull request #2004: SOLR-14942: Reduce leader election time on node shutdown
shalinmangar commented on pull request #2004: URL: https://github.com/apache/lucene-solr/pull/2004#issuecomment-715322784 @madrob would you like to make a final pass at this PR? I wish to merge to master and back port to 8x today. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] uschindler commented on a change in pull request #1995: LUCENE-9575 Add PatternTypingFilter to annotate tokens with flags and types
uschindler commented on a change in pull request #1995: URL: https://github.com/apache/lucene-solr/pull/1995#discussion_r510864013 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilter.java ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.pattern; + +import java.io.IOException; +import java.util.LinkedHashMap; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.FlagsAttribute; +import org.apache.lucene.analysis.tokenattributes.TypeAttribute; + +/** + * Set a type attribute to a parameterized value when tokens are matched by any of a several regex patterns. The + * value set in the type attribute is parameterized with the match groups of the regex used for matching. + * In combination with TypeAsSynonymFilter and DropIfFlagged filter this can supply complex synonym patterns + * that are protected from subsequent analysis, and optionally drop the original term based on the flag + * set in this filter. See {@link PatternTypingFilterFactory} for full documentation. + * + * @since 8.8.0 + * @see PatternTypingFilterFactory + */ +public class PatternTypingFilter extends TokenFilter { + + private final Map patterns; + private final Map flags; + private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); + private final FlagsAttribute flagAtt = addAttribute(FlagsAttribute.class); + private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class); + + public PatternTypingFilter(TokenStream input, LinkedHashMap patterns, Map flags) { +super(input); +this.patterns = patterns; +this.flags = flags; + } + + @Override + public final boolean incrementToken() throws IOException { +if (input.incrementToken()) { + if (hasAttribute(CharTermAttribute.class)) { +String termText = termAtt.toString(); +for (Map.Entry patRep : patterns.entrySet()) { + Pattern pattern = patRep.getKey(); + Matcher matcher = pattern.matcher(termText); + String replaced = matcher.replaceFirst(patRep.getValue()); + // N.B. Does not support producing a synonym identical to the original term. + // Avoids having to match() then replace() which performs a second find(). + if (!replaced.equals(termText)) { Review comment: I think that's fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] uschindler commented on a change in pull request #1995: LUCENE-9575 Add PatternTypingFilter to annotate tokens with flags and types
uschindler commented on a change in pull request #1995: URL: https://github.com/apache/lucene-solr/pull/1995#discussion_r510868759 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilter.java ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.pattern; + +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.FlagsAttribute; +import org.apache.lucene.analysis.tokenattributes.TypeAttribute; + +import java.io.IOException; +import java.util.LinkedHashMap; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Set a type attribute to a parameterized value when tokens are matched by any of a several regex patterns. The + * value set in the type attribute is parameterized with the match groups of the regex used for matching. + * In combination with TypeAsSynonymFilter and DropIfFlagged filter this can supply complex synonym patterns + * that are protected from subsequent analysis, and optionally drop the original term based on the flag + * set in this filter. See {@link PatternTypingFilterFactory} for full documentation. + * + * @see PatternTypingFilterFactory + * @since 8.8.0 + */ +public class PatternTypingFilter extends TokenFilter { + + private final Map> replacementAndFlagByPattern; + private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); + private final FlagsAttribute flagAtt = addAttribute(FlagsAttribute.class); + private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class); + + public PatternTypingFilter(TokenStream input, LinkedHashMap> replacementAndFlagByPattern) { Review comment: Explicitely saying LinkedHashMap sounds strange. I know the list must be sorted, so it's more a list. I liked it use a record. This would be a classical exple of the new Java 6 record types! This is a public API, any client code may call this - also non-Solr users. So maybe the constructor argument should be `a List` (a new class, which may actually be a Record in Java 16/17). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] uschindler commented on a change in pull request #1995: LUCENE-9575 Add PatternTypingFilter to annotate tokens with flags and types
uschindler commented on a change in pull request #1995: URL: https://github.com/apache/lucene-solr/pull/1995#discussion_r510869235 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilter.java ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.pattern; + +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.FlagsAttribute; +import org.apache.lucene.analysis.tokenattributes.TypeAttribute; + +import java.io.IOException; +import java.util.LinkedHashMap; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Set a type attribute to a parameterized value when tokens are matched by any of a several regex patterns. The + * value set in the type attribute is parameterized with the match groups of the regex used for matching. + * In combination with TypeAsSynonymFilter and DropIfFlagged filter this can supply complex synonym patterns + * that are protected from subsequent analysis, and optionally drop the original term based on the flag + * set in this filter. See {@link PatternTypingFilterFactory} for full documentation. + * + * @see PatternTypingFilterFactory + * @since 8.8.0 + */ +public class PatternTypingFilter extends TokenFilter { + + private final Map> replacementAndFlagByPattern; + private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); + private final FlagsAttribute flagAtt = addAttribute(FlagsAttribute.class); + private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class); + + public PatternTypingFilter(TokenStream input, LinkedHashMap> replacementAndFlagByPattern) { Review comment: FYI, I was not aware that the Map of Patterns and Elements are exposed as public API. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] uschindler commented on a change in pull request #1995: LUCENE-9575 Add PatternTypingFilter to annotate tokens with flags and types
uschindler commented on a change in pull request #1995: URL: https://github.com/apache/lucene-solr/pull/1995#discussion_r510870382 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilter.java ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.pattern; + +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.FlagsAttribute; +import org.apache.lucene.analysis.tokenattributes.TypeAttribute; + +import java.io.IOException; +import java.util.LinkedHashMap; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Set a type attribute to a parameterized value when tokens are matched by any of a several regex patterns. The + * value set in the type attribute is parameterized with the match groups of the regex used for matching. + * In combination with TypeAsSynonymFilter and DropIfFlagged filter this can supply complex synonym patterns + * that are protected from subsequent analysis, and optionally drop the original term based on the flag + * set in this filter. See {@link PatternTypingFilterFactory} for full documentation. + * + * @see PatternTypingFilterFactory + * @since 8.8.0 + */ +public class PatternTypingFilter extends TokenFilter { + + private final Map> replacementAndFlagByPattern; + private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); + private final FlagsAttribute flagAtt = addAttribute(FlagsAttribute.class); + private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class); + + public PatternTypingFilter(TokenStream input, LinkedHashMap> replacementAndFlagByPattern) { Review comment: If we have a class `PatternTypingRule` we can make constructors to create those: `new PatternTypeingRule(Pattern pattern, String replacement, int flag)` but also `new PatternTypingRule(String pattern, String replacement, int flag)`. I would strogly prefer tono misuse maps This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] uschindler commented on a change in pull request #1995: LUCENE-9575 Add PatternTypingFilter to annotate tokens with flags and types
uschindler commented on a change in pull request #1995: URL: https://github.com/apache/lucene-solr/pull/1995#discussion_r510870382 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilter.java ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.pattern; + +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.FlagsAttribute; +import org.apache.lucene.analysis.tokenattributes.TypeAttribute; + +import java.io.IOException; +import java.util.LinkedHashMap; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Set a type attribute to a parameterized value when tokens are matched by any of a several regex patterns. The + * value set in the type attribute is parameterized with the match groups of the regex used for matching. + * In combination with TypeAsSynonymFilter and DropIfFlagged filter this can supply complex synonym patterns + * that are protected from subsequent analysis, and optionally drop the original term based on the flag + * set in this filter. See {@link PatternTypingFilterFactory} for full documentation. + * + * @see PatternTypingFilterFactory + * @since 8.8.0 + */ +public class PatternTypingFilter extends TokenFilter { + + private final Map> replacementAndFlagByPattern; + private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); + private final FlagsAttribute flagAtt = addAttribute(FlagsAttribute.class); + private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class); + + public PatternTypingFilter(TokenStream input, LinkedHashMap> replacementAndFlagByPattern) { Review comment: If we have a class `PatternTypingRule` we can make constructors to create those: `new PatternTypeingRule(Pattern pattern, String replacement, int flag)` but also `new PatternTypingRule(String pattern, String replacement, int flag)`. I would strongly prefer to not misuse maps. We should also add a varargs constructor `PatternTypingFilter(TokenStream input, PatternTypeingRule... rules) ` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] uschindler commented on a change in pull request #1995: LUCENE-9575 Add PatternTypingFilter to annotate tokens with flags and types
uschindler commented on a change in pull request #1995: URL: https://github.com/apache/lucene-solr/pull/1995#discussion_r510868759 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilter.java ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.pattern; + +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.FlagsAttribute; +import org.apache.lucene.analysis.tokenattributes.TypeAttribute; + +import java.io.IOException; +import java.util.LinkedHashMap; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Set a type attribute to a parameterized value when tokens are matched by any of a several regex patterns. The + * value set in the type attribute is parameterized with the match groups of the regex used for matching. + * In combination with TypeAsSynonymFilter and DropIfFlagged filter this can supply complex synonym patterns + * that are protected from subsequent analysis, and optionally drop the original term based on the flag + * set in this filter. See {@link PatternTypingFilterFactory} for full documentation. + * + * @see PatternTypingFilterFactory + * @since 8.8.0 + */ +public class PatternTypingFilter extends TokenFilter { + + private final Map> replacementAndFlagByPattern; + private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); + private final FlagsAttribute flagAtt = addAttribute(FlagsAttribute.class); + private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class); + + public PatternTypingFilter(TokenStream input, LinkedHashMap> replacementAndFlagByPattern) { Review comment: Explicitely saying LinkedHashMap sounds strange. I know the list must be sorted, so it's more a list. I would like it to use a record. This would be a classical exple of the new Java 16 record types! This is a public API, any client code may call this - also non-Solr users. So maybe the constructor argument should be `a List` (a new class, which may actually be a Record in Java 16/17). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov commented on a change in pull request #2018: LUCENE-9582: rename VectorValues.ScoreFunction to SearchStrategy
msokolov commented on a change in pull request #2018: URL: https://github.com/apache/lucene-solr/pull/2018#discussion_r510895391 ## File path: lucene/core/src/java/org/apache/lucene/util/VectorUtil.java ## @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util; + +/** + * Utilities for computations with numeric arrays + */ +public final class VectorUtil { Review comment: oh, good call. I will add This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown
[ https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219733#comment-17219733 ] Shalin Shekhar Mangar commented on SOLR-14942: -- Thanks Hoss. I have updated the PR with code comments. Mike Drob also gave some feedback on the PR which has been incorporated as well. I intend to merge to master over the weekend. > Reduce leader election time on node shutdown > > > Key: SOLR-14942 > URL: https://issues.apache.org/jira/browse/SOLR-14942 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.7.3, 8.6.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > The credit for this issue and investigation belongs to [~caomanhdat]. I am > merely reporting the issue and creating PRs based on his work. > The shutdown process waits for all replicas/cores to be closed before > removing the election node of the leader. This can take some time due to > index flush or merge activities on the leader cores and delays new leaders > from being elected. > This process happens at CoreContainer.shutdown(): > # zkController.preClose(): remove current node from live_node and change > states of all cores in this node to DOWN state. Assuming that the current > node hosting a leader of a shard, the shard becomes leaderless after calling > this method, since the state of the leader is DOWN now. The leader election > process is not triggered for the shard since the election node is still > on-hold by the current node. > # Waiting for all cores to be loaded (if there are any). > # SolrCores.close(): close all cores. > # zkController.close(): this is where all ephemeral nodes are removed from ZK > which include election nodes created by this node. Therefore other replicas > in the shard can take part in the leader election from now. > Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive > SIGTERM signal. > On receiving SIGTERM, Jetty will also stop accepting new connections and new > requests. This is a very important factor, since even if the leader replica > is ACTIVE and its node in live_nodes, the shard will be considered as > leaderless if no-one can index to that shard. Therefore shards become > leaderless as soon as the node (which contains shard’s leader) receives > SIGTERM. > Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards > remain leaderless. The time needed for step 3 scales with the number of cores > so the more cores a node has, the worse. This time is spent in > IndexWriter.close() where the system will > # Flush all pending updates to disk > # Waiting for all merge finish (this most likely is the meaty part) > The shutdown process is proposed to changed to: > # Wait for all in-flight indexing requests and replication requests to > complete > # Remove election nodes > # Close all replicas/cores > This ensures that index flush or merges do not block new leader elections > anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] epugh commented on pull request #2016: SOLR-14067 v2 Move Stateless Scripting Update Process to /contrib
epugh commented on pull request #2016: URL: https://github.com/apache/lucene-solr/pull/2016#issuecomment-715416443 I'm getting there, and I wanted to specifically mention that @chatman other PR was critical for me following the chain of changes required to move this to /contrib. Thanks @chatman for chasing the all the touchpoints. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9583) How should we expose VectorValues.RandomAccess?
Michael Sokolov created LUCENE-9583: --- Summary: How should we expose VectorValues.RandomAccess? Key: LUCENE-9583 URL: https://issues.apache.org/jira/browse/LUCENE-9583 Project: Lucene - Core Issue Type: Improvement Reporter: Michael Sokolov In the newly-added VectorValues API, we have a RandomAccess sub-interface. [[~jtibshirani] pointed out this is not needed by some vector-indexing strategies which can operate solely using a forward-iterator (it is needed by HNSW), and so in the interest of simplifying the public API we should not expose this internal detail (which by the way surfaces internal ordinals that are somewhat uninteresting outside the random access API). I looked into how to move this inside the HNSW-specific code and remembered that we do also currently make use of the RA API when merging vector fields over sorted indexes. Without it, we would need to load all vectors into RAM while flushing/merging, as we currently do in BinaryDocValuesWriter.BinaryDVs. I wonder if it's worth paying this cost for the simpler API. Another thing I noticed while reviewing this is that I moved the KNN `search(float[] target, int topK, int fanout)` method from `VectorValues` to `VectorValues.RandomAccess`. This I think we could move back, and handle the HNSW requirements for search elsewhere. I wonder if that would alleviate the major concern here? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9583) How should we expose VectorValues.RandomAccess?
[ https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Sokolov updated LUCENE-9583: Description: In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} sub-interface. [~jtibshirani] pointed out this is not needed by some vector-indexing strategies which can operate solely using a forward-iterator (it is needed by HNSW), and so in the interest of simplifying the public API we should not expose this internal detail (which by the way surfaces internal ordinals that are somewhat uninteresting outside the random access API). I looked into how to move this inside the HNSW-specific code and remembered that we do also currently make use of the RA API when merging vector fields over sorted indexes. Without it, we would need to load all vectors into RAM while flushing/merging, as we currently do in {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost for the simpler API. Another thing I noticed while reviewing this is that I moved the KNN {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} to {{VectorValues.RandomAccess}}. This I think we could move back, and handle the HNSW requirements for search elsewhere. I wonder if that would alleviate the major concern here? was: In the newly-added VectorValues API, we have a RandomAccess sub-interface. [[~jtibshirani] pointed out this is not needed by some vector-indexing strategies which can operate solely using a forward-iterator (it is needed by HNSW), and so in the interest of simplifying the public API we should not expose this internal detail (which by the way surfaces internal ordinals that are somewhat uninteresting outside the random access API). I looked into how to move this inside the HNSW-specific code and remembered that we do also currently make use of the RA API when merging vector fields over sorted indexes. Without it, we would need to load all vectors into RAM while flushing/merging, as we currently do in BinaryDocValuesWriter.BinaryDVs. I wonder if it's worth paying this cost for the simpler API. Another thing I noticed while reviewing this is that I moved the KNN `search(float[] target, int topK, int fanout)` method from `VectorValues` to `VectorValues.RandomAccess`. This I think we could move back, and handle the HNSW requirements for search elsewhere. I wonder if that would alleviate the major concern here? > How should we expose VectorValues.RandomAccess? > --- > > Key: LUCENE-9583 > URL: https://issues.apache.org/jira/browse/LUCENE-9583 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > > In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} > sub-interface. [~jtibshirani] pointed out this is not needed by some > vector-indexing strategies which can operate solely using a forward-iterator > (it is needed by HNSW), and so in the interest of simplifying the public API > we should not expose this internal detail (which by the way surfaces internal > ordinals that are somewhat uninteresting outside the random access API). > I looked into how to move this inside the HNSW-specific code and remembered > that we do also currently make use of the RA API when merging vector fields > over sorted indexes. Without it, we would need to load all vectors into RAM > while flushing/merging, as we currently do in > {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost > for the simpler API. > Another thing I noticed while reviewing this is that I moved the KNN > {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} > to {{VectorValues.RandomAccess}}. This I think we could move back, and > handle the HNSW requirements for search elsewhere. I wonder if that would > alleviate the major concern here? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #2020: SOLR-14949: Ability to customize Solr Docker build
madrob commented on a change in pull request #2020: URL: https://github.com/apache/lucene-solr/pull/2020#discussion_r510986901 ## File path: .github/workflows/docker-test.yml ## @@ -17,6 +17,10 @@ jobs: runs-on: ubuntu-latest +env: + DOCKER_SOLR_IMAGE_REPO: github-pr/solr Review comment: Does this mean we will publish all of our PR docker images somewhere? ## File path: help/docker.txt ## @@ -0,0 +1,53 @@ +Docker Images for Solr +== + +Solr docker images are built using Palantir's Docker Gradle plugin, https://github.com/palantir/gradle-docker. + +Common Inputs +- + +The docker image and it's tag can be customized via the following options, all accepted via both Environment Variables and Gradle Properties. + +Docker Image Repository: + Default: "apache/solr" + EnvVar: DOCKER_SOLR_IMAGE_REPO + Gradle Property: -Pdocker.solr.imageRepo Review comment: Should this be solr.docker instead of docker.solr? ## File path: solr/docker/tests/cases/version/test.sh ## @@ -1,45 +0,0 @@ -#!/bin/bash Review comment: Deleted because we can set the tag and version to be anything now? ## File path: help/docker.txt ## @@ -0,0 +1,53 @@ +Docker Images for Solr +== + +Solr docker images are built using Palantir's Docker Gradle plugin, https://github.com/palantir/gradle-docker. + +Common Inputs +- + +The docker image and it's tag can be customized via the following options, all accepted via both Environment Variables and Gradle Properties. Review comment: s/it's/its ## File path: help/docker.txt ## @@ -0,0 +1,53 @@ +Docker Images for Solr +== + +Solr docker images are built using Palantir's Docker Gradle plugin, https://github.com/palantir/gradle-docker. + +Common Inputs +- + +The docker image and it's tag can be customized via the following options, all accepted via both Environment Variables and Gradle Properties. + +Docker Image Repository: + Default: "apache/solr" + EnvVar: DOCKER_SOLR_IMAGE_REPO + Gradle Property: -Pdocker.solr.imageRepo + +Docker Image Tag: + Default: the Solr version, e.g. "9.0.0-SNAPSHOT" + EnvVar: DOCKER_SOLR_IMAGE_TAG + Gradle Property: -Pdocker.solr.imageTag + +Docker Image Name: (Use this to explicitly set a whole image name. If given, the image repo and image version options above are ignored.) + Default: {image_repo}/{image_tag} (both options provided above, with defaults) + EnvVar: DOCKER_SOLR_IMAGE_NAME + Gradle Property: -Pdocker.solr.imageName + +Building + + +In order to build the Solr Docker image, run: + +gradlew docker + +The docker build task (`gradlew docker`) accepts the following inputs, in addition to the common inputs listed above: Review comment: This is displayed as plain text, not markdown. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on a change in pull request #2020: SOLR-14949: Ability to customize Solr Docker build
HoustonPutman commented on a change in pull request #2020: URL: https://github.com/apache/lucene-solr/pull/2020#discussion_r510994999 ## File path: solr/docker/tests/cases/version/test.sh ## @@ -1,45 +0,0 @@ -#!/bin/bash Review comment: Yeah, this is something that might be useful when we start creating release artifacts, but the test we want will probably look quite different. This is pretty shallow and doesn't actually provide a whole lot. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on a change in pull request #2020: SOLR-14949: Ability to customize Solr Docker build
HoustonPutman commented on a change in pull request #2020: URL: https://github.com/apache/lucene-solr/pull/2020#discussion_r510995608 ## File path: .github/workflows/docker-test.yml ## @@ -17,6 +17,10 @@ jobs: runs-on: ubuntu-latest +env: + DOCKER_SOLR_IMAGE_REPO: github-pr/solr Review comment: nah this doesn't actually push. It's just testing that the custom repo/tag works. I'm pretty sure it would fail if it tried, because github-pr isn't a docker hub repo. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on a change in pull request #2020: SOLR-14949: Ability to customize Solr Docker build
HoustonPutman commented on a change in pull request #2020: URL: https://github.com/apache/lucene-solr/pull/2020#discussion_r510997092 ## File path: help/docker.txt ## @@ -0,0 +1,53 @@ +Docker Images for Solr +== + +Solr docker images are built using Palantir's Docker Gradle plugin, https://github.com/palantir/gradle-docker. + +Common Inputs +- + +The docker image and it's tag can be customized via the following options, all accepted via both Environment Variables and Gradle Properties. + +Docker Image Repository: + Default: "apache/solr" + EnvVar: DOCKER_SOLR_IMAGE_REPO + Gradle Property: -Pdocker.solr.imageRepo Review comment: I could go either way. Say we add a prometheus exporter. Which would we prefer? - `docker.solr.imageRepo` and `docker.prometheusExporter.imageRepo` - `solr.docker.imageRepo` and `prometheusExporter.docker.imageRepo` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9581) Clarify discardCompoundToken behavior in the JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219808#comment-17219808 ] Kazuaki Hiraga commented on LUCENE-9581: Thank you for your input, [~jimczi]. My patch was just showing the super easy approach to fix the issue as a short term solution (and I have tried to remember the discussion of a bit confusable options which is a different story from this issue, though). Why I have modified the the minimal length for the penalty is that that is a similar idea what we can specify the behavior of MeCab's unknown word processing for the known words (In MeCab, not only Kanji characters but also others can be targeted and can be configurable by a configuration file, though), and I think `>=` is better in some cases (but this can be created another discussion). Anyway, this is a different story and I think your approach is appropriate for resolving the issue. So, I agree with your approach. {quote}I am also unsure that we should make discardCompoundToken true by default in Lucene 9 {quote} As we have discussed in LUCENE-9123, we want to change the default behavior of the current search mode that the tokenization results will be the same with `discardCompoundToken=true`. If I understand correctly, the result of the discussion is that 1) search mode will not return the compound tokens along with the decomposed tokens in Lucene 9 (Tokenizer won't return the compound tokens unless explicitly `discardCompoundToken=false` is specified), 2) merge the normal mode and search mode to only return the decomposed tokens, and remove the mode and related parameters in Lucene 10(?). Any opinions / suggestions ? > Clarify discardCompoundToken behavior in the JapaneseTokenizer > -- > > Key: LUCENE-9581 > URL: https://issues.apache.org/jira/browse/LUCENE-9581 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-9581.patch, LUCENE-9581.patch > > > At first sight, the discardCompoundToken option added in LUCENE-9123 seems > redundant with the NORMAL mode of the Japanese tokenizer. When set to true, > the current behavior is to disable the decomposition for compounds, that's > exactly what the NORMAL mode does. > So I wonder if the right semantic of the option would be to keep only the > decomposition of the compound or if it's really needed. If the goal is to > make the output compatible with a graph token filter, the current workaround > to set the mode to NORMAL should be enough. > That's consistent with the mode that should be used to preserve positions in > the index since we don't handle position length on the indexing side. > Am I missing something regarding the new option ? Is there a compelling case > where it differs from the NORMAL mode ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov commented on pull request #1993: .gitignore clean up
msokolov commented on pull request #1993: URL: https://github.com/apache/lucene-solr/pull/1993#issuecomment-715478148 > @msokolov you added .#* -- what comment should I use in this file to explain what this is? It's more cruft emacs sometimes leaves behind _ I think in this case it's an autosave backup file left if you exited while editing. On second thought, we probably don't need to list this here - it shouldn't arise as a normal thing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] samuelgmartinez opened a new pull request #2021: SOLR-14844: upgrade jetty to 9.4.32.v20200930
samuelgmartinez opened a new pull request #2021: URL: https://github.com/apache/lucene-solr/pull/2021 # Description Upgrades Jetty to 9.4.32.v20200930 as described in JIRA ticket. # Solution After upgrading the compression related tests started to fail, so some of the broken unit tests were modified also. The reasons behind the broken unit tests are described in the original ticket. Also, created SOLR-14945 to track the need to improve HttpSolrClient compression handling to avoid problems like this in the future. # Tests BasicHttpSolrClientTest had to be modified in order to match Jetty's new behaviour for empty responses. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `master` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] epugh commented on pull request #2016: SOLR-14067 v2 Move Stateless Scripting Update Process to /contrib
epugh commented on pull request #2016: URL: https://github.com/apache/lucene-solr/pull/2016#issuecomment-715500720 Okay, this PR is kind of "ready". I've migrated the content out of the Cwiki page, and into the ref guide. One thing I want to highlight is that I'd like to be able to easily demonstrate the power of the ScriptingUpdateProcessor with the Techproducts example, however to do that, I had to set the `enableStreamBody` to true to make it work. I hope the fact that it is enabled when you do `bin/solr start -e techproducts` is okay. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14939) JSON facets: range faceting to support cache=false parameter
[ https://issues.apache.org/jira/browse/SOLR-14939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219868#comment-17219868 ] Joel Bernstein commented on SOLR-14939: --- Interesting, I would not have suspected that range queries would have this effect. > JSON facets: range faceting to support cache=false parameter > > > Key: SOLR-14939 > URL: https://issues.apache.org/jira/browse/SOLR-14939 > Project: Solr > Issue Type: Bug >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > The {{cache}} parameter, if set to {{false}}, is intended to support > non-caching of the search results: > [https://lucene.apache.org/solr/guide/8_6/common-query-parameters.html#cache-parameter] > Based on inspection of > {code:java} > curl > "http://localhost:8983/solr/admin/metrics?prefix=CACHE.searcher.filterCache"; > {code} > metrics before and after a search we can see that range JSON facet queries > currently do not support a {{cache=false}} parameter. > Using the {{techproducts}} example collection as an illustration, if a 1 > MONTH {{gap}} value is used then 12 {{filterCache}} entries are added for a > one year {{start/end}} time range. > {code:java} > curl "http://localhost:8983/solr/techproducts/query?q=*:*&rows=0&cache=false"; > -d 'json.facet={ > manufacturedate_dt_ranges : { > type : range, > field : manufacturedate_dt, > mincount : 1, > gap : "%2B1MONTH", > start : "2005-01-01T00:00:00.000Z", > end : "2005-12-31T23:59:59.999Z", > } > }' > {code} > Similarly, if a 1 DAY {{gap}} value is used then 365 {{filterCache}} entries > are added for a one year {{start/end}} time range and if a 1 HOUR {{gap}} > value were to be used that would equate to 365 x 24 = 8,760 entries. This > means that a single search potentially displaces many or all existing > {{filterCache}} entries. > This ticket proposes to support the {{cache}} parameter for JSON range facet > queries: > * the current and default behaviour would remain {{cache=true}} and > * via {{cache=false}} users would be able run an 'uncommon' search with many > range buckets without impact on the 'common' searches with fewer range > buckets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-14939) JSON facets: range faceting to support cache=false parameter
[ https://issues.apache.org/jira/browse/SOLR-14939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219868#comment-17219868 ] Joel Bernstein edited comment on SOLR-14939 at 10/23/20, 6:44 PM: -- Interesting, I would not have suspected that range facets would have this effect. was (Author: joel.bernstein): Interesting, I would not have suspected that range queries would have this effect. > JSON facets: range faceting to support cache=false parameter > > > Key: SOLR-14939 > URL: https://issues.apache.org/jira/browse/SOLR-14939 > Project: Solr > Issue Type: Bug >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > The {{cache}} parameter, if set to {{false}}, is intended to support > non-caching of the search results: > [https://lucene.apache.org/solr/guide/8_6/common-query-parameters.html#cache-parameter] > Based on inspection of > {code:java} > curl > "http://localhost:8983/solr/admin/metrics?prefix=CACHE.searcher.filterCache"; > {code} > metrics before and after a search we can see that range JSON facet queries > currently do not support a {{cache=false}} parameter. > Using the {{techproducts}} example collection as an illustration, if a 1 > MONTH {{gap}} value is used then 12 {{filterCache}} entries are added for a > one year {{start/end}} time range. > {code:java} > curl "http://localhost:8983/solr/techproducts/query?q=*:*&rows=0&cache=false"; > -d 'json.facet={ > manufacturedate_dt_ranges : { > type : range, > field : manufacturedate_dt, > mincount : 1, > gap : "%2B1MONTH", > start : "2005-01-01T00:00:00.000Z", > end : "2005-12-31T23:59:59.999Z", > } > }' > {code} > Similarly, if a 1 DAY {{gap}} value is used then 365 {{filterCache}} entries > are added for a one year {{start/end}} time range and if a 1 HOUR {{gap}} > value were to be used that would equate to 365 x 24 = 8,760 entries. This > means that a single search potentially displaces many or all existing > {{filterCache}} entries. > This ticket proposes to support the {{cache}} parameter for JSON range facet > queries: > * the current and default behaviour would remain {{cache=true}} and > * via {{cache=false}} users would be able run an 'uncommon' search with many > range buckets without impact on the 'common' searches with fewer range > buckets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14067) Move StatelessScriptUpdateProcessor to a contrib
[ https://issues.apache.org/jira/browse/SOLR-14067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219869#comment-17219869 ] David Eric Pugh commented on SOLR-14067: [~tflobbe]do you have an example of where we have done the rename? I like that idea, that way I can go through and use the cleaned up "ScriptingUpdateRequestProcessor" everywhere, and have it still be backwards compatible. [~dsmiley] I like the idea this is only in 9, and not back porting it. Should we update the wiki page to say it is moved in 9, and not touched in 8.x? https://cwiki.apache.org/confluence/display/SOLR/Deprecations > Move StatelessScriptUpdateProcessor to a contrib > > > Key: SOLR-14067 > URL: https://issues.apache.org/jira/browse/SOLR-14067 > Project: Solr > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: David Eric Pugh >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > Move server-side scripting out of core and into a new contrib. This is > better for security. > Former description: > > We should eliminate all scripting capabilities within Solr. Let us start with > the StatelessScriptUpdateProcessor deprecation/removal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani commented on a change in pull request #2018: LUCENE-9582: rename VectorValues.ScoreFunction to SearchStrategy
jtibshirani commented on a change in pull request #2018: URL: https://github.com/apache/lucene-solr/pull/2018#discussion_r511087483 ## File path: lucene/queries/src/java/org/apache/lucene/queries/intervals/IntervalQuery.java ## @@ -99,7 +99,7 @@ public IntervalQuery(String field, IntervalsSource intervalsSource, float pivot, private IntervalQuery(String field, IntervalsSource intervalsSource, IntervalScoreFunction scoreFunction) { Objects.requireNonNull(field, "null field aren't accepted"); Objects.requireNonNull(intervalsSource, "null intervalsSource aren't accepted"); -Objects.requireNonNull(scoreFunction, "null scoreFunction aren't accepted"); +Objects.requireNonNull(scoreFunction, "null searchStrategy aren't accepted"); Review comment: Just noticed this rename seems accidental. ## File path: lucene/core/src/test/org/apache/lucene/util/TestVectorUtil.java ## @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util; + +public class TestVectorUtil extends LuceneTestCase { + + public void testBasicDotProduct() { +assertEquals(5, VectorUtil.dotProduct(new float[]{1, 2, 3}, new float[]{-10, 0, 5}), 0); + } + + public void testSelfDotProduct() { +// the dot product of a vector with itself is equal to the sum of the squares of its components +float[] v = randomVector(); +assertEquals(l2(v), VectorUtil.dotProduct(v, v), 1e-5); + } + + public void testOrthogonalDotProduct() { +// the dot product of two perpendicular vectors is 0 +float[] v = new float[2]; +v[0] = random().nextInt(100); +v[1] = random().nextInt(100); +float[] u = new float[2]; +u[0] = v[1]; +u[1] = -v[0]; +assertEquals(0, VectorUtil.dotProduct(u, v), 1e-5); + } + + public void testSelfSquareSum() { +// the l2 distance of a vector with itself is zero +float[] v = randomVector(); +assertEquals(0, VectorUtil.squareSum(v, v), 1e-5); + } + + public void testBasicSquareSum() { +assertEquals(12, VectorUtil.squareSum(new float[]{1, 2, 3}, new float[]{-1, 0, 5}), 0); + } + + public void testRandomSquareSum() { +// the MSE of a vector with its inverse is equal to four times the sum of squares of its components Review comment: Small comment: I don't think you mean MSE here since it's not a mean, it could be 'squared distance' ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14844) Upgrade Jetty to 9.4.32.v20200930
[ https://issues.apache.org/jira/browse/SOLR-14844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219892#comment-17219892 ] Samuel García Martínez commented on SOLR-14844: --- branch_8x pull request: https://github.com/apache/lucene-solr/pull/2003 master pull request: https://github.com/apache/lucene-solr/pull/2021 master pull request fails because, after upgrading Jetty, junit changes to 4.12 for some reason, so the checksum fails on precommit. master is currently using 4.13.1. I need some help to create the checksums for that dependency (and others that may changed also). Also, I've opened SOLR-14945 to address the problems with the interceptors and refactoring SolrJ client to avoid this kind of issues in the future (relying on the HttpClient directly, instead of writing custom classes to handle compression and whatnot). > Upgrade Jetty to 9.4.32.v20200930 > - > > Key: SOLR-14844 > URL: https://issues.apache.org/jira/browse/SOLR-14844 > Project: Solr > Issue Type: Improvement >Affects Versions: 8.6 >Reporter: Cassandra Targett >Assignee: Erick Erickson >Priority: Major > Attachments: SOLR-14844-master.patch, SOLR-14884-8x.patch > > Time Spent: 20m > Remaining Estimate: 0h > > A CVE was found in Jetty 9.4.27-9.4.29 that has some security scanning tools > raising red flags > ([https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-17638]). > Here's the Jetty issue: > [https://bugs.eclipse.org/bugs/show_bug.cgi?id=564984]. It's fixed in > 9.4.30+, so we should upgrade to that for 8.7 > -It has a simple mitigation (raise Jetty's responseHeaderSize to higher than > requestHeaderSize), but I don't know how Solr uses Jetty well enough to a) > know if this problem is even exploitable in Solr, or b) if the workaround > suggested is even possible in Solr.- > In normal Solr installs, w/o jetty optimizations, this issue is largely > mitigated in 8.6.3: see SOLR-14896 (and linked bug fixes) for details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14844) Upgrade Jetty to 9.4.32.v20200930
[ https://issues.apache.org/jira/browse/SOLR-14844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219908#comment-17219908 ] Erick Erickson commented on SOLR-14844: --- I'll take a look today sometime, it can be a rat-hole to straighten out. Thanks! > Upgrade Jetty to 9.4.32.v20200930 > - > > Key: SOLR-14844 > URL: https://issues.apache.org/jira/browse/SOLR-14844 > Project: Solr > Issue Type: Improvement >Affects Versions: 8.6 >Reporter: Cassandra Targett >Assignee: Erick Erickson >Priority: Major > Attachments: SOLR-14844-master.patch, SOLR-14884-8x.patch > > Time Spent: 20m > Remaining Estimate: 0h > > A CVE was found in Jetty 9.4.27-9.4.29 that has some security scanning tools > raising red flags > ([https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-17638]). > Here's the Jetty issue: > [https://bugs.eclipse.org/bugs/show_bug.cgi?id=564984]. It's fixed in > 9.4.30+, so we should upgrade to that for 8.7 > -It has a simple mitigation (raise Jetty's responseHeaderSize to higher than > requestHeaderSize), but I don't know how Solr uses Jetty well enough to a) > know if this problem is even exploitable in Solr, or b) if the workaround > suggested is even possible in Solr.- > In normal Solr installs, w/o jetty optimizations, this issue is largely > mitigated in 8.6.3: see SOLR-14896 (and linked bug fixes) for details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14067) Move StatelessScriptUpdateProcessor to a contrib
[ https://issues.apache.org/jira/browse/SOLR-14067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219926#comment-17219926 ] David Smiley commented on SOLR-14067: - BTW, RE XSLT, it'd clearly be another issue. If the change is in 9.0 but not 8.x, no need to handle back-compat. Users already need to know to use the package. I don't see why this component needs to be mentioned at all on a page named "Deprecations". Moving within the project is not a deprecation. It does need to be mentioned on {{major-changes-in-solr-9.adoc}}. > Move StatelessScriptUpdateProcessor to a contrib > > > Key: SOLR-14067 > URL: https://issues.apache.org/jira/browse/SOLR-14067 > Project: Solr > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: David Eric Pugh >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > Move server-side scripting out of core and into a new contrib. This is > better for security. > Former description: > > We should eliminate all scripting capabilities within Solr. Let us start with > the StatelessScriptUpdateProcessor deprecation/removal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14413) allow timeAllowed and cursorMark parameters
[ https://issues.apache.org/jira/browse/SOLR-14413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219927#comment-17219927 ] Yevhen Tienkaiev commented on SOLR-14413: - Hello, is there any status on this? This is pretty critical, please someone can push this forward? > allow timeAllowed and cursorMark parameters > --- > > Key: SOLR-14413 > URL: https://issues.apache.org/jira/browse/SOLR-14413 > Project: Solr > Issue Type: Improvement > Components: search >Reporter: John Gallagher >Priority: Minor > Attachments: SOLR-14413-bram.patch, SOLR-14413-jg-update1.patch, > SOLR-14413-jg-update2.patch, SOLR-14413.patch, > image-2020-08-18-16-56-41-736.png, image-2020-08-18-16-56-59-178.png, > image-2020-08-21-14-18-36-229.png, timeallowed_cursormarks_results.txt > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Ever since cursorMarks were introduced in SOLR-5463 in 2014, cursorMark and > timeAllowed parameters were not allowed in combination ("Can not search using > both cursorMark and timeAllowed") > , from [QueryComponent.java|#L359]]: > > {code:java} > > if (null != rb.getCursorMark() && 0 < timeAllowed) { > // fundamentally incompatible > throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Can not > search using both " + CursorMarkParams.CURSOR_MARK_PARAM + " and " + > CommonParams.TIME_ALLOWED); > } {code} > While theoretically impure to use them in combination, it is often desirable > to support cursormarks-style deep paging and attempt to protect Solr nodes > from runaway queries using timeAllowed, in the hopes that most of the time, > the query completes in the allotted time, and there is no conflict. > > However if the query takes too long, it may be preferable to end the query > and protect the Solr node and provide the user with a somewhat inaccurate > sorted list. As noted in SOLR-6930, SOLR-5986 and others, timeAllowed is > frequently used to prevent runaway load. In fact, cursorMark and > shards.tolerant are allowed in combination, so any argument in favor of > purity would be a bit muddied in my opinion. > > This was discussed once in the mailing list that I can find: > [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201506.mbox/%3c5591740b.4080...@elyograg.org%3E] > It did not look like there was strong support for preventing the combination. > > I have tested cursorMark and timeAllowed combination together, and even when > partial results are returned because the timeAllowed is exceeded, the > cursorMark response value is still valid and reasonable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on pull request #1993: .gitignore clean up
dsmiley commented on pull request #1993: URL: https://github.com/apache/lucene-solr/pull/1993#issuecomment-715572444 Finally; I think it's ready. Much simpler file than before and more organized. Outdated items from the 8x branch are gone. I'll merge Monday unless I get an approving review sooner. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov commented on pull request #2018: LUCENE-9582: rename VectorValues.ScoreFunction to SearchStrategy
msokolov commented on pull request #2018: URL: https://github.com/apache/lucene-solr/pull/2018#issuecomment-715591773 thanks for the review! I'll fix when merging This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov merged pull request #2018: LUCENE-9582: rename VectorValues.ScoreFunction to SearchStrategy
msokolov merged pull request #2018: URL: https://github.com/apache/lucene-solr/pull/2018 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9582) Rename VectorValues.ScoreFunction to SearchStrategy
[ https://issues.apache.org/jira/browse/LUCENE-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219948#comment-17219948 ] ASF subversion and git services commented on LUCENE-9582: - Commit 840a353bc7062c1ab8fc0ab7ebeaa68ccf97fac1 in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=840a353 ] LUCENE-9582: rename VectorValues.ScoreFunction to SearchStrategy (#2018) Co-authored-by: Julie Tibshirani > Rename VectorValues.ScoreFunction to SearchStrategy > > > Key: LUCENE-9582 > URL: https://issues.apache.org/jira/browse/LUCENE-9582 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > This is an issue to apply some of the feedback from LUCENE-9322 that came > after it was pushed; we want to: > 1. rename VectorValues.ScoreFunction -> SearchStrategy (and all of the > references to that terminology), and make it a simple enum with no > implementation > 2. rename the strategies to indicate the ANN implementation that backs them, > so we can represent more than one such implementation/algorithm. > 3. Move scoring implementation to a utility class > I'll open a separate issue for exploring how to hide the > VectorValues.RandomAccess API, which is probably specific to HNSW > FYI [~jtibshirani] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov opened a new pull request #2022: LUCENE-9004: KNN vector search using NSW graphs
msokolov opened a new pull request #2022: URL: https://github.com/apache/lucene-solr/pull/2022 Phew this has been a long time coming, but I think it is in good shape now. We started with a scratchy prototype about a year ago, then @mocobeta got it on a better footing by adding a new codec and also implemented the full hierarchical algorithm, making the graph search faithful to the published literature. Then we took a step back to add underlying vector format as a separate patch, now landed. This patch builds on the new vector format, providing KNN search with NSW graphs. It's the simplest implementation I could tease out (single layer graph, simple neighbor selection, no max fanout control), but I think it will be a good foundation. I've done some pretty extensive performance testing and hyperparameter exploration using the (included) KnnGraphTester with some proprietary data, and get good results. I will follow up later with specifics, but single-threaded latencies in a few ms on my i7 laptop over a 1M x 256-dim dataset seems pretty good. Followups will include repeatable benchmarks on public datasets. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on pull request #2004: SOLR-14942: Reduce leader election time on node shutdown
madrob commented on pull request #2004: URL: https://github.com/apache/lucene-solr/pull/2004#issuecomment-715622888 Please add a CHANGES entry, and credit @CaoManhDat as well, if this work was based on initial work done by him. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14067) Move StatelessScriptUpdateProcessor to a contrib
[ https://issues.apache.org/jira/browse/SOLR-14067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219962#comment-17219962 ] Tomas Eduardo Fernandez Lobbe commented on SOLR-14067: -- bq. Tomas Eduardo Fernandez Lobbe do you have an example of where we have done the rename? I like that idea, that way I can go through and use the cleaned up "ScriptingUpdateRequestProcessor" everywhere, and have it still be backwards compatible. I'm sure this has been done in other places, but see for example the SolrServer -> SolrClient rename: https://issues.apache.org/jira/browse/SOLR-6895 > Move StatelessScriptUpdateProcessor to a contrib > > > Key: SOLR-14067 > URL: https://issues.apache.org/jira/browse/SOLR-14067 > Project: Solr > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: David Eric Pugh >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > Move server-side scripting out of core and into a new contrib. This is > better for security. > Former description: > > We should eliminate all scripting capabilities within Solr. Let us start with > the StatelessScriptUpdateProcessor deprecation/removal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14067) Move StatelessScriptUpdateProcessor to a contrib
[ https://issues.apache.org/jira/browse/SOLR-14067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219964#comment-17219964 ] Tomas Eduardo Fernandez Lobbe commented on SOLR-14067: -- Here is another, more recent, example: https://github.com/apache/lucene-solr/commit/0836ea5/ > Move StatelessScriptUpdateProcessor to a contrib > > > Key: SOLR-14067 > URL: https://issues.apache.org/jira/browse/SOLR-14067 > Project: Solr > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: David Eric Pugh >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > Move server-side scripting out of core and into a new contrib. This is > better for security. > Former description: > > We should eliminate all scripting capabilities within Solr. Let us start with > the StatelessScriptUpdateProcessor deprecation/removal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14354) HttpShardHandler send requests in async
[ https://issues.apache.org/jira/browse/SOLR-14354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Thacker updated SOLR-14354: - Attachment: image-2020-10-23-16-45-21-789.png > HttpShardHandler send requests in async > --- > > Key: SOLR-14354 > URL: https://issues.apache.org/jira/browse/SOLR-14354 > Project: Solr > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Blocker > Fix For: master (9.0), 8.7 > > Attachments: image-2020-03-23-10-04-08-399.png, > image-2020-03-23-10-09-10-221.png, image-2020-03-23-10-12-00-661.png, > image-2020-10-23-16-45-20-034.png, image-2020-10-23-16-45-21-789.png, > image-2020-10-23-16-45-37-628.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > h2. 1. Current approach (problem) of Solr > Below is the diagram describe the model on how currently handling a request. > !image-2020-03-23-10-04-08-399.png! > The main-thread that handles the search requests, will submit n requests (n > equals to number of shards) to an executor. So each request will correspond > to a thread, after sending a request that thread basically do nothing just > waiting for response from other side. That thread will be swapped out and CPU > will try to handle another thread (this is called context switch, CPU will > save the context of the current thread and switch to another one). When some > data (not all) come back, that thread will be called to parsing these data, > then it will wait until more data come back. So there will be lots of context > switching in CPU. That is quite inefficient on using threads.Basically we > want less threads and most of them must busy all the time, because threads > are not free as well as context switching. That is the main idea behind > everything, like executor > h2. 2. Async call of Jetty HttpClient > Jetty HttpClient offers async API like this. > {code:java} > httpClient.newRequest("http://domain.com/path";) > // Add request hooks > .onRequestQueued(request -> { ... }) > .onRequestBegin(request -> { ... }) > // Add response hooks > .onResponseBegin(response -> { ... }) > .onResponseHeaders(response -> { ... }) > .onResponseContent((response, buffer) -> { ... }) > .send(result -> { ... }); {code} > Therefore after calling {{send()}} the thread will return immediately without > any block. Then when the client received the header from other side, it will > call {{onHeaders()}} listeners. When the client received some {{byte[]}} (not > all response) from the data it will call {{onContent(buffer)}} listeners. > When everything finished it will call {{onComplete}} listeners. One main > thing that will must notice here is all listeners should finish quick, if the > listener block, all further data of that request won’t be handled until the > listener finish. > h2. 3. Solution 1: Sending requests async but spin one thread per response > Jetty HttpClient already provides several listeners, one of them is > InputStreamResponseListener. This is how it is get used > {code:java} > InputStreamResponseListener listener = new InputStreamResponseListener(); > client.newRequest(...).send(listener); > // Wait for the response headers to arrive > Response response = listener.get(5, TimeUnit.SECONDS); > if (response.getStatus() == 200) { > // Obtain the input stream on the response content > try (InputStream input = listener.getInputStream()) { > // Read the response content > } > } {code} > In this case, there will be 2 thread > * one thread trying to read the response content from InputStream > * one thread (this is a short-live task) feeding content to above > InputStream whenever some byte[] is available. Note that if this thread > unable to feed data into InputStream, this thread will wait. > By using this one, the model of HttpShardHandler can be written into > something like this > {code:java} > handler.sendReq(req, (is) -> { > executor.submit(() -> > try (is) { > // Read the content from InputStream > } > ) > }) {code} > The first diagram will be changed into this > !image-2020-03-23-10-09-10-221.png! > Notice that although “sending req to shard1” is wide, it won’t take long time > since sending req is a very quick operation. With this operation, handling > threads won’t be spin up until first bytes are sent back. Notice that in this > approach we still have active threads waiting for more data from InputStream > h2. 4. Solution 2: Buffering data and handle it inside jetty’s thread. > Jetty have another listener called BufferingResponseListener. This is how it > is get used > {code:java} > client.newRequest(...).send(new BufferingResponseListener() { > public void onComplete(Result result) { >
[jira] [Updated] (SOLR-14354) HttpShardHandler send requests in async
[ https://issues.apache.org/jira/browse/SOLR-14354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Thacker updated SOLR-14354: - Attachment: image-2020-10-23-16-45-37-628.png > HttpShardHandler send requests in async > --- > > Key: SOLR-14354 > URL: https://issues.apache.org/jira/browse/SOLR-14354 > Project: Solr > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Blocker > Fix For: master (9.0), 8.7 > > Attachments: image-2020-03-23-10-04-08-399.png, > image-2020-03-23-10-09-10-221.png, image-2020-03-23-10-12-00-661.png, > image-2020-10-23-16-45-20-034.png, image-2020-10-23-16-45-21-789.png, > image-2020-10-23-16-45-37-628.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > h2. 1. Current approach (problem) of Solr > Below is the diagram describe the model on how currently handling a request. > !image-2020-03-23-10-04-08-399.png! > The main-thread that handles the search requests, will submit n requests (n > equals to number of shards) to an executor. So each request will correspond > to a thread, after sending a request that thread basically do nothing just > waiting for response from other side. That thread will be swapped out and CPU > will try to handle another thread (this is called context switch, CPU will > save the context of the current thread and switch to another one). When some > data (not all) come back, that thread will be called to parsing these data, > then it will wait until more data come back. So there will be lots of context > switching in CPU. That is quite inefficient on using threads.Basically we > want less threads and most of them must busy all the time, because threads > are not free as well as context switching. That is the main idea behind > everything, like executor > h2. 2. Async call of Jetty HttpClient > Jetty HttpClient offers async API like this. > {code:java} > httpClient.newRequest("http://domain.com/path";) > // Add request hooks > .onRequestQueued(request -> { ... }) > .onRequestBegin(request -> { ... }) > // Add response hooks > .onResponseBegin(response -> { ... }) > .onResponseHeaders(response -> { ... }) > .onResponseContent((response, buffer) -> { ... }) > .send(result -> { ... }); {code} > Therefore after calling {{send()}} the thread will return immediately without > any block. Then when the client received the header from other side, it will > call {{onHeaders()}} listeners. When the client received some {{byte[]}} (not > all response) from the data it will call {{onContent(buffer)}} listeners. > When everything finished it will call {{onComplete}} listeners. One main > thing that will must notice here is all listeners should finish quick, if the > listener block, all further data of that request won’t be handled until the > listener finish. > h2. 3. Solution 1: Sending requests async but spin one thread per response > Jetty HttpClient already provides several listeners, one of them is > InputStreamResponseListener. This is how it is get used > {code:java} > InputStreamResponseListener listener = new InputStreamResponseListener(); > client.newRequest(...).send(listener); > // Wait for the response headers to arrive > Response response = listener.get(5, TimeUnit.SECONDS); > if (response.getStatus() == 200) { > // Obtain the input stream on the response content > try (InputStream input = listener.getInputStream()) { > // Read the response content > } > } {code} > In this case, there will be 2 thread > * one thread trying to read the response content from InputStream > * one thread (this is a short-live task) feeding content to above > InputStream whenever some byte[] is available. Note that if this thread > unable to feed data into InputStream, this thread will wait. > By using this one, the model of HttpShardHandler can be written into > something like this > {code:java} > handler.sendReq(req, (is) -> { > executor.submit(() -> > try (is) { > // Read the content from InputStream > } > ) > }) {code} > The first diagram will be changed into this > !image-2020-03-23-10-09-10-221.png! > Notice that although “sending req to shard1” is wide, it won’t take long time > since sending req is a very quick operation. With this operation, handling > threads won’t be spin up until first bytes are sent back. Notice that in this > approach we still have active threads waiting for more data from InputStream > h2. 4. Solution 2: Buffering data and handle it inside jetty’s thread. > Jetty have another listener called BufferingResponseListener. This is how it > is get used > {code:java} > client.newRequest(...).send(new BufferingResponseListener() { > public void onComplete(Result result) { >
[jira] [Updated] (SOLR-14354) HttpShardHandler send requests in async
[ https://issues.apache.org/jira/browse/SOLR-14354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Thacker updated SOLR-14354: - Attachment: image-2020-10-23-16-45-20-034.png > HttpShardHandler send requests in async > --- > > Key: SOLR-14354 > URL: https://issues.apache.org/jira/browse/SOLR-14354 > Project: Solr > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Blocker > Fix For: master (9.0), 8.7 > > Attachments: image-2020-03-23-10-04-08-399.png, > image-2020-03-23-10-09-10-221.png, image-2020-03-23-10-12-00-661.png, > image-2020-10-23-16-45-20-034.png, image-2020-10-23-16-45-21-789.png, > image-2020-10-23-16-45-37-628.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > h2. 1. Current approach (problem) of Solr > Below is the diagram describe the model on how currently handling a request. > !image-2020-03-23-10-04-08-399.png! > The main-thread that handles the search requests, will submit n requests (n > equals to number of shards) to an executor. So each request will correspond > to a thread, after sending a request that thread basically do nothing just > waiting for response from other side. That thread will be swapped out and CPU > will try to handle another thread (this is called context switch, CPU will > save the context of the current thread and switch to another one). When some > data (not all) come back, that thread will be called to parsing these data, > then it will wait until more data come back. So there will be lots of context > switching in CPU. That is quite inefficient on using threads.Basically we > want less threads and most of them must busy all the time, because threads > are not free as well as context switching. That is the main idea behind > everything, like executor > h2. 2. Async call of Jetty HttpClient > Jetty HttpClient offers async API like this. > {code:java} > httpClient.newRequest("http://domain.com/path";) > // Add request hooks > .onRequestQueued(request -> { ... }) > .onRequestBegin(request -> { ... }) > // Add response hooks > .onResponseBegin(response -> { ... }) > .onResponseHeaders(response -> { ... }) > .onResponseContent((response, buffer) -> { ... }) > .send(result -> { ... }); {code} > Therefore after calling {{send()}} the thread will return immediately without > any block. Then when the client received the header from other side, it will > call {{onHeaders()}} listeners. When the client received some {{byte[]}} (not > all response) from the data it will call {{onContent(buffer)}} listeners. > When everything finished it will call {{onComplete}} listeners. One main > thing that will must notice here is all listeners should finish quick, if the > listener block, all further data of that request won’t be handled until the > listener finish. > h2. 3. Solution 1: Sending requests async but spin one thread per response > Jetty HttpClient already provides several listeners, one of them is > InputStreamResponseListener. This is how it is get used > {code:java} > InputStreamResponseListener listener = new InputStreamResponseListener(); > client.newRequest(...).send(listener); > // Wait for the response headers to arrive > Response response = listener.get(5, TimeUnit.SECONDS); > if (response.getStatus() == 200) { > // Obtain the input stream on the response content > try (InputStream input = listener.getInputStream()) { > // Read the response content > } > } {code} > In this case, there will be 2 thread > * one thread trying to read the response content from InputStream > * one thread (this is a short-live task) feeding content to above > InputStream whenever some byte[] is available. Note that if this thread > unable to feed data into InputStream, this thread will wait. > By using this one, the model of HttpShardHandler can be written into > something like this > {code:java} > handler.sendReq(req, (is) -> { > executor.submit(() -> > try (is) { > // Read the content from InputStream > } > ) > }) {code} > The first diagram will be changed into this > !image-2020-03-23-10-09-10-221.png! > Notice that although “sending req to shard1” is wide, it won’t take long time > since sending req is a very quick operation. With this operation, handling > threads won’t be spin up until first bytes are sent back. Notice that in this > approach we still have active threads waiting for more data from InputStream > h2. 4. Solution 2: Buffering data and handle it inside jetty’s thread. > Jetty have another listener called BufferingResponseListener. This is how it > is get used > {code:java} > client.newRequest(...).send(new BufferingResponseListener() { > public void onComplete(Result result) { >
[jira] [Commented] (SOLR-14354) HttpShardHandler send requests in async
[ https://issues.apache.org/jira/browse/SOLR-14354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219975#comment-17219975 ] Varun Thacker commented on SOLR-14354: -- Whenever we've taken flamegraphs ( [https://github.com/jvm-profiling-tools/async-profiler] with -e wall ) of production solr clusters, HttpShardHandler has taken significant amount of wall clock time !image-2020-10-23-16-45-37-628.png! I would have been really curious to find out the performance implications of this change on the cluster. Perhaps in a months timeframe I can try to apply the patch on top of 8.7 ( we'll first have to upgrade to 8.7 ) and then report back with some real numbers. > HttpShardHandler send requests in async > --- > > Key: SOLR-14354 > URL: https://issues.apache.org/jira/browse/SOLR-14354 > Project: Solr > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Blocker > Fix For: master (9.0), 8.7 > > Attachments: image-2020-03-23-10-04-08-399.png, > image-2020-03-23-10-09-10-221.png, image-2020-03-23-10-12-00-661.png, > image-2020-10-23-16-45-20-034.png, image-2020-10-23-16-45-21-789.png, > image-2020-10-23-16-45-37-628.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > h2. 1. Current approach (problem) of Solr > Below is the diagram describe the model on how currently handling a request. > !image-2020-03-23-10-04-08-399.png! > The main-thread that handles the search requests, will submit n requests (n > equals to number of shards) to an executor. So each request will correspond > to a thread, after sending a request that thread basically do nothing just > waiting for response from other side. That thread will be swapped out and CPU > will try to handle another thread (this is called context switch, CPU will > save the context of the current thread and switch to another one). When some > data (not all) come back, that thread will be called to parsing these data, > then it will wait until more data come back. So there will be lots of context > switching in CPU. That is quite inefficient on using threads.Basically we > want less threads and most of them must busy all the time, because threads > are not free as well as context switching. That is the main idea behind > everything, like executor > h2. 2. Async call of Jetty HttpClient > Jetty HttpClient offers async API like this. > {code:java} > httpClient.newRequest("http://domain.com/path";) > // Add request hooks > .onRequestQueued(request -> { ... }) > .onRequestBegin(request -> { ... }) > // Add response hooks > .onResponseBegin(response -> { ... }) > .onResponseHeaders(response -> { ... }) > .onResponseContent((response, buffer) -> { ... }) > .send(result -> { ... }); {code} > Therefore after calling {{send()}} the thread will return immediately without > any block. Then when the client received the header from other side, it will > call {{onHeaders()}} listeners. When the client received some {{byte[]}} (not > all response) from the data it will call {{onContent(buffer)}} listeners. > When everything finished it will call {{onComplete}} listeners. One main > thing that will must notice here is all listeners should finish quick, if the > listener block, all further data of that request won’t be handled until the > listener finish. > h2. 3. Solution 1: Sending requests async but spin one thread per response > Jetty HttpClient already provides several listeners, one of them is > InputStreamResponseListener. This is how it is get used > {code:java} > InputStreamResponseListener listener = new InputStreamResponseListener(); > client.newRequest(...).send(listener); > // Wait for the response headers to arrive > Response response = listener.get(5, TimeUnit.SECONDS); > if (response.getStatus() == 200) { > // Obtain the input stream on the response content > try (InputStream input = listener.getInputStream()) { > // Read the response content > } > } {code} > In this case, there will be 2 thread > * one thread trying to read the response content from InputStream > * one thread (this is a short-live task) feeding content to above > InputStream whenever some byte[] is available. Note that if this thread > unable to feed data into InputStream, this thread will wait. > By using this one, the model of HttpShardHandler can be written into > something like this > {code:java} > handler.sendReq(req, (is) -> { > executor.submit(() -> > try (is) { > // Read the content from InputStream > } > ) > }) {code} > The first diagram will be changed into this > !image-2020-03-23-10-09-10-221.png! > Notice that although “sending req to shard1” is wide, it won’t take long time > since sending
[GitHub] [lucene-solr] muse-dev[bot] commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs
muse-dev[bot] commented on a change in pull request #2022: URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r511202225 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java ## @@ -165,42 +191,88 @@ public VectorValues getVectorValues(String field) throws IOException { return new OffHeapVectorValues(fieldEntry, bytesSlice); } + // exposed for testing + public KnnGraphValues getGraphValues(String field) throws IOException { +FieldInfo info = fieldInfos.fieldInfo(field); +if (info == null) { + throw new IllegalArgumentException("No such field '" + field + "'"); +} +FieldEntry entry = fields.get(field); +if (entry != null && entry.indexDataLength > 0) { + return getGraphValues(entry); +} else { + return KnnGraphValues.EMPTY; +} + } + + private KnnGraphValues getGraphValues(FieldEntry entry) throws IOException { +if (isHnswStrategy(entry.searchStrategy)) { + HnswGraphFieldEntry graphEntry = (HnswGraphFieldEntry) entry; + IndexInput bytesSlice = vectorIndex.slice("graph-data", entry.indexDataOffset, entry.indexDataLength); + return new IndexedKnnGraphReader(graphEntry, bytesSlice); +} else { + return KnnGraphValues.EMPTY; +} + } + @Override public void close() throws IOException { -vectorData.close(); +IOUtils.close(vectorData, vectorIndex); } private static class FieldEntry { final int dimension; final VectorValues.SearchStrategy searchStrategy; -final int maxDoc; final long vectorDataOffset; final long vectorDataLength; +final long indexDataOffset; +final long indexDataLength; final int[] ordToDoc; -FieldEntry(int dimension, VectorValues.SearchStrategy searchStrategy, int maxDoc, - long vectorDataOffset, long vectorDataLength, int[] ordToDoc) { - this.dimension = dimension; +FieldEntry(DataInput input, VectorValues.SearchStrategy searchStrategy) throws IOException { this.searchStrategy = searchStrategy; - this.maxDoc = maxDoc; - this.vectorDataOffset = vectorDataOffset; - this.vectorDataLength = vectorDataLength; - this.ordToDoc = ordToDoc; + vectorDataOffset = input.readVLong(); + vectorDataLength = input.readVLong(); + indexDataOffset = input.readVLong(); + indexDataLength = input.readVLong(); + dimension = input.readInt(); + int size = input.readInt(); + ordToDoc = new int[size]; + for (int i = 0; i < size; i++) { +int doc = input.readVInt(); +ordToDoc[i] = doc; + } } int size() { return ordToDoc.length; } } + private static class HnswGraphFieldEntry extends FieldEntry { + +final long[] ordOffsets; + +HnswGraphFieldEntry(DataInput input, VectorValues.SearchStrategy searchStrategy) throws IOException { + super(input, searchStrategy); + ordOffsets = new long[size()]; + long offset = 0; + for (int i = 0; i < ordOffsets.length; i++) { +offset += input.readVLong(); +ordOffsets[i] = offset; + } +} + } + /** Read the vector values from the index input. This supports both iterated and random access. */ - private final static class OffHeapVectorValues extends VectorValues { + private final class OffHeapVectorValues extends VectorValues { final FieldEntry fieldEntry; final IndexInput dataIn; +final Random random = new Random(); Review comment: *PREDICTABLE_RANDOM:* This random generator (java.util.Random) is predictable [(details)](https://find-sec-bugs.github.io/bugs.htm#PREDICTABLE_RANDOM) ## File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java ## @@ -0,0 +1,186 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.hnsw; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Comparator; +import java.util.List; +import java.util.Random; + +import org.apache.lucene.index.KnnGraphValues; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.util.BytesRef; + +/** + * Builder for HNSW
[jira] [Commented] (SOLR-14354) HttpShardHandler send requests in async
[ https://issues.apache.org/jira/browse/SOLR-14354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219994#comment-17219994 ] Mark Robert Miller commented on SOLR-14354: --- It’s likely a decent improvement, the issue is really just throwing in the change without reasonable verification of what it results in practically (eg http2 client is not very good at connection reuse right now, large number of shards and high request rate should be big benefactor, but what are the implications for few shards, reasonable request rate). I’m pretty into the idea that it’s a good move with lots of benefits, but it requires some pretty rigorous testing to make this kind of change. I have the benchmarks to check things in a pretty comprehensive way, eventually I’ll turn some of them to master and can check this - I was planning on that when I first commented here - but then I realized it needed further work with http2 impl and config at a minimum and I was not going to do that work on master and the results could still even be better but also push a path I wouldn’t agree is ready, so not so valuable > HttpShardHandler send requests in async > --- > > Key: SOLR-14354 > URL: https://issues.apache.org/jira/browse/SOLR-14354 > Project: Solr > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Blocker > Fix For: master (9.0), 8.7 > > Attachments: image-2020-03-23-10-04-08-399.png, > image-2020-03-23-10-09-10-221.png, image-2020-03-23-10-12-00-661.png, > image-2020-10-23-16-45-20-034.png, image-2020-10-23-16-45-21-789.png, > image-2020-10-23-16-45-37-628.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > h2. 1. Current approach (problem) of Solr > Below is the diagram describe the model on how currently handling a request. > !image-2020-03-23-10-04-08-399.png! > The main-thread that handles the search requests, will submit n requests (n > equals to number of shards) to an executor. So each request will correspond > to a thread, after sending a request that thread basically do nothing just > waiting for response from other side. That thread will be swapped out and CPU > will try to handle another thread (this is called context switch, CPU will > save the context of the current thread and switch to another one). When some > data (not all) come back, that thread will be called to parsing these data, > then it will wait until more data come back. So there will be lots of context > switching in CPU. That is quite inefficient on using threads.Basically we > want less threads and most of them must busy all the time, because threads > are not free as well as context switching. That is the main idea behind > everything, like executor > h2. 2. Async call of Jetty HttpClient > Jetty HttpClient offers async API like this. > {code:java} > httpClient.newRequest("http://domain.com/path";) > // Add request hooks > .onRequestQueued(request -> { ... }) > .onRequestBegin(request -> { ... }) > // Add response hooks > .onResponseBegin(response -> { ... }) > .onResponseHeaders(response -> { ... }) > .onResponseContent((response, buffer) -> { ... }) > .send(result -> { ... }); {code} > Therefore after calling {{send()}} the thread will return immediately without > any block. Then when the client received the header from other side, it will > call {{onHeaders()}} listeners. When the client received some {{byte[]}} (not > all response) from the data it will call {{onContent(buffer)}} listeners. > When everything finished it will call {{onComplete}} listeners. One main > thing that will must notice here is all listeners should finish quick, if the > listener block, all further data of that request won’t be handled until the > listener finish. > h2. 3. Solution 1: Sending requests async but spin one thread per response > Jetty HttpClient already provides several listeners, one of them is > InputStreamResponseListener. This is how it is get used > {code:java} > InputStreamResponseListener listener = new InputStreamResponseListener(); > client.newRequest(...).send(listener); > // Wait for the response headers to arrive > Response response = listener.get(5, TimeUnit.SECONDS); > if (response.getStatus() == 200) { > // Obtain the input stream on the response content > try (InputStream input = listener.getInputStream()) { > // Read the response content > } > } {code} > In this case, there will be 2 thread > * one thread trying to read the response content from InputStream > * one thread (this is a short-live task) feeding content to above > InputStream whenever some byte[] is available. Note that if this thread > unable to feed data into InputStream, this thread will wa
[jira] [Updated] (SOLR-14413) allow timeAllowed and cursorMark parameters
[ https://issues.apache.org/jira/browse/SOLR-14413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Gallagher updated SOLR-14413: -- Attachment: Screen Shot 2020-10-23 at 10.08.26 PM.png > allow timeAllowed and cursorMark parameters > --- > > Key: SOLR-14413 > URL: https://issues.apache.org/jira/browse/SOLR-14413 > Project: Solr > Issue Type: Improvement > Components: search >Reporter: John Gallagher >Priority: Minor > Attachments: SOLR-14413-bram.patch, SOLR-14413-jg-update1.patch, > SOLR-14413-jg-update2.patch, SOLR-14413.patch, Screen Shot 2020-10-23 at > 10.08.26 PM.png, image-2020-08-18-16-56-41-736.png, > image-2020-08-18-16-56-59-178.png, image-2020-08-21-14-18-36-229.png, > timeallowed_cursormarks_results.txt > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Ever since cursorMarks were introduced in SOLR-5463 in 2014, cursorMark and > timeAllowed parameters were not allowed in combination ("Can not search using > both cursorMark and timeAllowed") > , from [QueryComponent.java|#L359]]: > > {code:java} > > if (null != rb.getCursorMark() && 0 < timeAllowed) { > // fundamentally incompatible > throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Can not > search using both " + CursorMarkParams.CURSOR_MARK_PARAM + " and " + > CommonParams.TIME_ALLOWED); > } {code} > While theoretically impure to use them in combination, it is often desirable > to support cursormarks-style deep paging and attempt to protect Solr nodes > from runaway queries using timeAllowed, in the hopes that most of the time, > the query completes in the allotted time, and there is no conflict. > > However if the query takes too long, it may be preferable to end the query > and protect the Solr node and provide the user with a somewhat inaccurate > sorted list. As noted in SOLR-6930, SOLR-5986 and others, timeAllowed is > frequently used to prevent runaway load. In fact, cursorMark and > shards.tolerant are allowed in combination, so any argument in favor of > purity would be a bit muddied in my opinion. > > This was discussed once in the mailing list that I can find: > [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201506.mbox/%3c5591740b.4080...@elyograg.org%3E] > It did not look like there was strong support for preventing the combination. > > I have tested cursorMark and timeAllowed combination together, and even when > partial results are returned because the timeAllowed is exceeded, the > cursorMark response value is still valid and reasonable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14413) allow timeAllowed and cursorMark parameters
[ https://issues.apache.org/jira/browse/SOLR-14413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Gallagher updated SOLR-14413: -- Attachment: Screen Shot 2020-10-23 at 10.09.11 PM.png > allow timeAllowed and cursorMark parameters > --- > > Key: SOLR-14413 > URL: https://issues.apache.org/jira/browse/SOLR-14413 > Project: Solr > Issue Type: Improvement > Components: search >Reporter: John Gallagher >Priority: Minor > Attachments: SOLR-14413-bram.patch, SOLR-14413-jg-update1.patch, > SOLR-14413-jg-update2.patch, SOLR-14413.patch, Screen Shot 2020-10-23 at > 10.08.26 PM.png, Screen Shot 2020-10-23 at 10.09.11 PM.png, > image-2020-08-18-16-56-41-736.png, image-2020-08-18-16-56-59-178.png, > image-2020-08-21-14-18-36-229.png, timeallowed_cursormarks_results.txt > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Ever since cursorMarks were introduced in SOLR-5463 in 2014, cursorMark and > timeAllowed parameters were not allowed in combination ("Can not search using > both cursorMark and timeAllowed") > , from [QueryComponent.java|#L359]]: > > {code:java} > > if (null != rb.getCursorMark() && 0 < timeAllowed) { > // fundamentally incompatible > throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Can not > search using both " + CursorMarkParams.CURSOR_MARK_PARAM + " and " + > CommonParams.TIME_ALLOWED); > } {code} > While theoretically impure to use them in combination, it is often desirable > to support cursormarks-style deep paging and attempt to protect Solr nodes > from runaway queries using timeAllowed, in the hopes that most of the time, > the query completes in the allotted time, and there is no conflict. > > However if the query takes too long, it may be preferable to end the query > and protect the Solr node and provide the user with a somewhat inaccurate > sorted list. As noted in SOLR-6930, SOLR-5986 and others, timeAllowed is > frequently used to prevent runaway load. In fact, cursorMark and > shards.tolerant are allowed in combination, so any argument in favor of > purity would be a bit muddied in my opinion. > > This was discussed once in the mailing list that I can find: > [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201506.mbox/%3c5591740b.4080...@elyograg.org%3E] > It did not look like there was strong support for preventing the combination. > > I have tested cursorMark and timeAllowed combination together, and even when > partial results are returned because the timeAllowed is exceeded, the > cursorMark response value is still valid and reasonable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14413) allow timeAllowed and cursorMark parameters
[ https://issues.apache.org/jira/browse/SOLR-14413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Gallagher updated SOLR-14413: -- Attachment: SOLR-14413-jg-update3.patch > allow timeAllowed and cursorMark parameters > --- > > Key: SOLR-14413 > URL: https://issues.apache.org/jira/browse/SOLR-14413 > Project: Solr > Issue Type: Improvement > Components: search >Reporter: John Gallagher >Priority: Minor > Attachments: SOLR-14413-bram.patch, SOLR-14413-jg-update1.patch, > SOLR-14413-jg-update2.patch, SOLR-14413-jg-update3.patch, SOLR-14413.patch, > Screen Shot 2020-10-23 at 10.08.26 PM.png, Screen Shot 2020-10-23 at 10.09.11 > PM.png, image-2020-08-18-16-56-41-736.png, image-2020-08-18-16-56-59-178.png, > image-2020-08-21-14-18-36-229.png, timeallowed_cursormarks_results.txt > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Ever since cursorMarks were introduced in SOLR-5463 in 2014, cursorMark and > timeAllowed parameters were not allowed in combination ("Can not search using > both cursorMark and timeAllowed") > , from [QueryComponent.java|#L359]]: > > {code:java} > > if (null != rb.getCursorMark() && 0 < timeAllowed) { > // fundamentally incompatible > throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Can not > search using both " + CursorMarkParams.CURSOR_MARK_PARAM + " and " + > CommonParams.TIME_ALLOWED); > } {code} > While theoretically impure to use them in combination, it is often desirable > to support cursormarks-style deep paging and attempt to protect Solr nodes > from runaway queries using timeAllowed, in the hopes that most of the time, > the query completes in the allotted time, and there is no conflict. > > However if the query takes too long, it may be preferable to end the query > and protect the Solr node and provide the user with a somewhat inaccurate > sorted list. As noted in SOLR-6930, SOLR-5986 and others, timeAllowed is > frequently used to prevent runaway load. In fact, cursorMark and > shards.tolerant are allowed in combination, so any argument in favor of > purity would be a bit muddied in my opinion. > > This was discussed once in the mailing list that I can find: > [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201506.mbox/%3c5591740b.4080...@elyograg.org%3E] > It did not look like there was strong support for preventing the combination. > > I have tested cursorMark and timeAllowed combination together, and even when > partial results are returned because the timeAllowed is exceeded, the > cursorMark response value is still valid and reasonable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14413) allow timeAllowed and cursorMark parameters
[ https://issues.apache.org/jira/browse/SOLR-14413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220001#comment-17220001 ] John Gallagher commented on SOLR-14413: --- [~bvd] the issue was with an assumption that the test was making. It is not always true that every document will be found when using timeAllowed and cursorMark in combination. There may be holes in the result sets, but at least the ordering between and within result sets will be correct with respect to the sort. This is what I had suspected was the case when I proposed allowing the combination, but I didn't have an example at the time. I think it is still a good idea to allow these parameters in combination (its something that you could encounter when using shards.tolerant and cursorMark in combination, and that combination is allowed). When using timeAllowed and cursorMark in combination, and there are multiple segments in the index, it is possible that a query may terminate before visiting the matching documents in every segment. The hint for this is in the warning message's stack trace associated with the failing seed you found in the previous revision: [https://gist.github.com/slackhappy/1a48d56e10679404cea3441f87a0fecc#file-gistfile1-txt-L6] . "The request took too long to iterate over terms." occurs while in a specific segment, which prevents iterating on to the next segment. I have updated my pull request: [https://github.com/apache/lucene-solr/pull/1436] And I have updated my proposed documentation changes to include a mention that results may be missing if partialResults is true: !Screen Shot 2020-10-23 at 10.08.26 PM.png|width=545,height=114! !Screen Shot 2020-10-23 at 10.09.11 PM.png|width=577,height=161! > allow timeAllowed and cursorMark parameters > --- > > Key: SOLR-14413 > URL: https://issues.apache.org/jira/browse/SOLR-14413 > Project: Solr > Issue Type: Improvement > Components: search >Reporter: John Gallagher >Priority: Minor > Attachments: SOLR-14413-bram.patch, SOLR-14413-jg-update1.patch, > SOLR-14413-jg-update2.patch, SOLR-14413.patch, Screen Shot 2020-10-23 at > 10.08.26 PM.png, Screen Shot 2020-10-23 at 10.09.11 PM.png, > image-2020-08-18-16-56-41-736.png, image-2020-08-18-16-56-59-178.png, > image-2020-08-21-14-18-36-229.png, timeallowed_cursormarks_results.txt > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Ever since cursorMarks were introduced in SOLR-5463 in 2014, cursorMark and > timeAllowed parameters were not allowed in combination ("Can not search using > both cursorMark and timeAllowed") > , from [QueryComponent.java|#L359]]: > > {code:java} > > if (null != rb.getCursorMark() && 0 < timeAllowed) { > // fundamentally incompatible > throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Can not > search using both " + CursorMarkParams.CURSOR_MARK_PARAM + " and " + > CommonParams.TIME_ALLOWED); > } {code} > While theoretically impure to use them in combination, it is often desirable > to support cursormarks-style deep paging and attempt to protect Solr nodes > from runaway queries using timeAllowed, in the hopes that most of the time, > the query completes in the allotted time, and there is no conflict. > > However if the query takes too long, it may be preferable to end the query > and protect the Solr node and provide the user with a somewhat inaccurate > sorted list. As noted in SOLR-6930, SOLR-5986 and others, timeAllowed is > frequently used to prevent runaway load. In fact, cursorMark and > shards.tolerant are allowed in combination, so any argument in favor of > purity would be a bit muddied in my opinion. > > This was discussed once in the mailing list that I can find: > [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201506.mbox/%3c5591740b.4080...@elyograg.org%3E] > It did not look like there was strong support for preventing the combination. > > I have tested cursorMark and timeAllowed combination together, and even when > partial results are returned because the timeAllowed is exceeded, the > cursorMark response value is still valid and reasonable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org