[jira] [Commented] (SOLR-14306) Refactor coordination code into separate module and evaluate using Curator
[ https://issues.apache.org/jira/browse/SOLR-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051873#comment-17051873 ] Jan Høydahl commented on SOLR-14306: Seems Kafka has had the same discussions (see https://cwiki.apache.org/confluence/display/KAFKA/KIP-273+-+Kafka+to+support+using+ETCD+beside+Zookeeper) but I think they ended up with https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum instead, I.e handling state and coordination in Kafka instead of external system. Would be interesting to evaluate Apache Ratis (http://ratis.incubator.apache.org/) as an embedded zk replacement! > Refactor coordination code into separate module and evaluate using Curator > -- > > Key: SOLR-14306 > URL: https://issues.apache.org/jira/browse/SOLR-14306 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomas Eduardo Fernandez Lobbe >Priority: Major > > This Jira issue is to discuss two changes that unfortunately are difficult to > address separately > # Separate all ZooKeeper coordination logic into it’s own module, that can > be tested in isolation > # Evaluate using Apache Curator for coordination instead of our own logic. > I drafted a > [SIP|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148640472], > but this is very much WIP, I’d like to hear opinions before I spend too much > time on something people hates. > From the initial draft of the SIP: > {quote}The main goal of this change is to allow better testing of the > different ZooKeeper interactions related to coordination (leader election, > queues, etc). There are already some abstractions in place for lower level > operations (set-data, get-data, etc, see DistribStateManager), so the idea is > to have a new, related abstraction named CoordinationManager, where we could > have some higher level coordination-related classes, like LeaderRunner > (Overseer), LeaderLatch (for shard leaders), etc. Curator comes into place > because, in order to refactor the existing code into these new abstractions, > we’d have to rework much of it, so we could instead consider using Curator, a > library that was mentioned in the past many times. While I don’t think this > is required, It would make this transition and our code simpler (from what I > could see, however, input from people with more Curator experience would be > greatly appreciated). > While it would be out of the scope of this change, If the > abstractions/interfaces are correctly designed, this could lead to, in the > future, be able to use something other than ZooKeeper for coordination, > either etcd or maybe even some in-memory replacement for tests. > {quote} > There are still many open questions, and many questions I still don’t know > we’ll have, but please, let me know if you have any early feedback, specially > if you’ve worked with Curator in the past. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9077) Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051889#comment-17051889 ] Dawid Weiss commented on LUCENE-9077: - There is no word-for-word equivalent but the functionality is there (it's part of {{gradlew check}} on each project). See {{gradlew :helpDependencies}} if you need details: {code} Updating dependency checksum and licenses - The last step is to make sure the licenses, notice files and checksums are in place for any new dependencies. This command will print what's missing and where: gradlew licenses To update JAR checksums for licenses use: gradlew updateLicenses {code} > Gradle build > > > Key: LUCENE-9077 > URL: https://issues.apache.org/jira/browse/LUCENE-9077 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9077-javadoc-locale-en-US.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > This task focuses on providing gradle-based build equivalent for Lucene and > Solr (on master branch). See notes below on why this respin is needed. > The code lives on *gradle-master* branch. It is kept with sync with *master*. > Try running the following to see an overview of helper guides concerning > typical workflow, testing and ant-migration helpers: > gradlew :help > A list of items that needs to be added or requires work. If you'd like to > work on any of these, please add your name to the list. Once you have a > patch/ pull request let me (dweiss) know - I'll try to coordinate the merges. > * (/) Apply forbiddenAPIs > * (/) Generate hardware-aware gradle defaults for parallelism (count of > workers and test JVMs). > * (/) Fail the build if --tests filter is applied and no tests execute > during the entire build (this allows for an empty set of filtered tests at > single project level). > * (/) Port other settings and randomizations from common-build.xml > * (/) Configure security policy/ sandboxing for tests. > * (/) test's console output on -Ptests.verbose=true > * (/) add a :helpDeps explanation to how the dependency system works > (palantir plugin, lockfile) and how to retrieve structured information about > current dependencies of a given module (in a tree-like output). > * (/) jar checksums, jar checksum computation and validation. This should be > done without intermediate folders (directly on dependency sets). > * (/) verify min. JVM version and exact gradle version on build startup to > minimize odd build side-effects > * (/) Repro-line for failed tests/ runs. > * (/) add a top-level README note about building with gradle (and the > required JVM). > * (/) add an equivalent of 'validate-source-patterns' > (check-source-patterns.groovy) to precommit. > * (/) add an equivalent of 'rat-sources' to precommit. > * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) > to precommit. > * (/) javadoc compilation > Hard-to-implement stuff already investigated: > * (/) (done) -*Printing console output of failed tests.* There doesn't seem > to be any way to do this in a reasonably efficient way. There are onOutput > listeners but they're slow to operate and solr tests emit *tons* of output so > it's an overkill.- > * (!) (LUCENE-9120) *Tests working with security-debug logs or other > JVM-early log output*. Gradle's test runner works by redirecting Java's > stdout/ syserr so this just won't work. Perhaps we can spin the ant-based > test runner for such corner-cases. > Of lesser importance: > * Add an equivalent of 'documentation-lint" to precommit. > * (/) Do not require files to be committed before running precommit. (staged > files are fine). > * (/) add rendering of javadocs (gradlew javadoc) > * Attach javadocs to maven publications. > * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid > it'll be difficult to run it sensibly because gradle doesn't offer cwd > separation for the forked test runners. > * if you diff solr packaged distribution against ant-created distribution > there are minor differences in library versions and some JARs are excluded/ > moved around. I didn't try to force these as everything seems to work (tests, > etc.) – perhaps these differences should be fixed in the ant build instead. > * (/) identify and port various "regenerate" tasks from ant builds (javacc, > precompiled automata, etc.) > * Fill in POM details in gradle/defaults-maven.gradle so that they reflect > the previous content better (dependencies aside). > * Add any IDE integration layers that should be added (I use IntelliJ and it > imports the project out of the box, without the need for any spe
[jira] [Comment Edited] (LUCENE-9077) Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051889#comment-17051889 ] Dawid Weiss edited comment on LUCENE-9077 at 3/5/20, 8:27 AM: -- There is no word-for-word equivalent but the functionality is there (it's part of {{gradlew check}} on each project). See {{gradlew :helpDeps}} if you need details: {code} Updating dependency checksum and licenses - The last step is to make sure the licenses, notice files and checksums are in place for any new dependencies. This command will print what's missing and where: gradlew licenses To update JAR checksums for licenses use: gradlew updateLicenses {code} was (Author: dweiss): There is no word-for-word equivalent but the functionality is there (it's part of {{gradlew check}} on each project). See {{gradlew :helpDependencies}} if you need details: {code} Updating dependency checksum and licenses - The last step is to make sure the licenses, notice files and checksums are in place for any new dependencies. This command will print what's missing and where: gradlew licenses To update JAR checksums for licenses use: gradlew updateLicenses {code} > Gradle build > > > Key: LUCENE-9077 > URL: https://issues.apache.org/jira/browse/LUCENE-9077 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9077-javadoc-locale-en-US.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > This task focuses on providing gradle-based build equivalent for Lucene and > Solr (on master branch). See notes below on why this respin is needed. > The code lives on *gradle-master* branch. It is kept with sync with *master*. > Try running the following to see an overview of helper guides concerning > typical workflow, testing and ant-migration helpers: > gradlew :help > A list of items that needs to be added or requires work. If you'd like to > work on any of these, please add your name to the list. Once you have a > patch/ pull request let me (dweiss) know - I'll try to coordinate the merges. > * (/) Apply forbiddenAPIs > * (/) Generate hardware-aware gradle defaults for parallelism (count of > workers and test JVMs). > * (/) Fail the build if --tests filter is applied and no tests execute > during the entire build (this allows for an empty set of filtered tests at > single project level). > * (/) Port other settings and randomizations from common-build.xml > * (/) Configure security policy/ sandboxing for tests. > * (/) test's console output on -Ptests.verbose=true > * (/) add a :helpDeps explanation to how the dependency system works > (palantir plugin, lockfile) and how to retrieve structured information about > current dependencies of a given module (in a tree-like output). > * (/) jar checksums, jar checksum computation and validation. This should be > done without intermediate folders (directly on dependency sets). > * (/) verify min. JVM version and exact gradle version on build startup to > minimize odd build side-effects > * (/) Repro-line for failed tests/ runs. > * (/) add a top-level README note about building with gradle (and the > required JVM). > * (/) add an equivalent of 'validate-source-patterns' > (check-source-patterns.groovy) to precommit. > * (/) add an equivalent of 'rat-sources' to precommit. > * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) > to precommit. > * (/) javadoc compilation > Hard-to-implement stuff already investigated: > * (/) (done) -*Printing console output of failed tests.* There doesn't seem > to be any way to do this in a reasonably efficient way. There are onOutput > listeners but they're slow to operate and solr tests emit *tons* of output so > it's an overkill.- > * (!) (LUCENE-9120) *Tests working with security-debug logs or other > JVM-early log output*. Gradle's test runner works by redirecting Java's > stdout/ syserr so this just won't work. Perhaps we can spin the ant-based > test runner for such corner-cases. > Of lesser importance: > * Add an equivalent of 'documentation-lint" to precommit. > * (/) Do not require files to be committed before running precommit. (staged > files are fine). > * (/) add rendering of javadocs (gradlew javadoc) > * Attach javadocs to maven publications. > * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid > it'll be difficult to run it sensibly because gradle doesn't offer cwd > separation for the forked test runners. > * if you diff solr packaged distribution against ant-created distribution > there are minor differences in library
[jira] [Updated] (LUCENE-9258) DocTermsIndexDocValues should not assume it's operating on a SortedDocValues field
[ https://issues.apache.org/jira/browse/LUCENE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michele Palmia updated LUCENE-9258: --- Affects Version/s: 7.7.2 > DocTermsIndexDocValues should not assume it's operating on a SortedDocValues > field > -- > > Key: LUCENE-9258 > URL: https://issues.apache.org/jira/browse/LUCENE-9258 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 7.7.2, 8.4 >Reporter: Michele Palmia >Priority: Minor > Attachments: LUCENE-9258.patch > > > When requesting a new _ValueSourceScorer_ (with _getRangeScorer_) from > _DocTermsIndexDocValues_ , the latter instantiates a new iterator on > _SortedDocValues_ regardless of the fact that the underlying field can > actually be of a different type (e.g. a _SortedSetDocValues_ processed > through a _SortedSetSelector_). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] s1monw opened a new pull request #1319: LUCENE-9164: process all events before closing gracefully
s1monw opened a new pull request #1319: LUCENE-9164: process all events before closing gracefully URL: https://github.com/apache/lucene-solr/pull/1319 This is yet another / simpler approach to https://github.com/apache/lucene-solr/pull/1274 to ensure that all event are processed if we are closing the IW gracefully. This also improves the case where we closing due to a tragic event where we don't try to be heroic and just drop all pending events on the floor. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] s1monw commented on issue #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying
s1monw commented on issue #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying URL: https://github.com/apache/lucene-solr/pull/1274#issuecomment-595130965 @mikemccand @dnhatn I explored one more idea that less intrusive and more contained. I like this one much better https://github.com/apache/lucene-solr/pull/1319 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully
dweiss commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully URL: https://github.com/apache/lucene-solr/pull/1319#discussion_r388181480 ## File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java ## @@ -299,7 +300,70 @@ static int getActualMaxDocs() { final FieldNumbers globalFieldNumberMap; final DocumentsWriter docWriter; - private final Queue eventQueue = new ConcurrentLinkedQueue<>(); + private final CloseableQueue eventQueue = new CloseableQueue(); Review comment: Wouldn't it be nicer to make it just Closeable and pass IndexWriter in the constructor (instead of each method)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] s1monw commented on issue #1319: LUCENE-9164: process all events before closing gracefully
s1monw commented on issue #1319: LUCENE-9164: process all events before closing gracefully URL: https://github.com/apache/lucene-solr/pull/1319#issuecomment-595136991 thanks for looking @dweiss This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant opened a new pull request #1320: LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes.
bruno-roustant opened a new pull request #1320: LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes. URL: https://github.com/apache/lucene-solr/pull/1320 This PR modifies many classes because it removes Reader attributes now unused because FST is always loaded off-heap. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package
[ https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052040#comment-17052040 ] Bruno Roustant commented on LUCENE-9257: New PR#1320 which removes FSTLoadMode and *also* Reader attributes. When removing FSTLoadMode I realized that Reader attributes have been introduced for it and they now become unused. Since Reader attributes represent a lot of code I think it is interesting to remove it. If sometime in the future someone needs to get them back, this is commit a302be381ea611e57d32d7f277206e726329fa6e. Please tell me if it is ok to remove Reader attributes, or if we should keep them (but in this case, where do we define the attribute key constant?). > FSTLoadMode should not be BlockTree specific as it is used more generally in > index package > -- > > Key: LUCENE-9257 > URL: https://issues.apache.org/jira/browse/LUCENE-9257 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > FSTLoadMode and its associate attribute key (static String) are currently > defined in BlockTreeTermsReader, but they are actually used outside of > BlockTree in the general "index" package. > CheckIndex and ReadersAndUpdates are using these enum and attribute key to > drive the FST load mode through the SegmentReader which is not specific to a > postings format. They have an unnecessary dependency to BlockTreeTermsReader. > We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public > enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not > import anymore BlockTreeTermsReader. > This would also allow other postings formats to use the same enum (e.g. > LUCENE-9254) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory
Yannick Welsch created LUCENE-9264: -- Summary: Remove SimpleFSDirectory in favor of NIOFsDirectory Key: LUCENE-9264 URL: https://issues.apache.org/jira/browse/LUCENE-9264 Project: Lucene - Core Issue Type: Improvement Reporter: Yannick Welsch {{SimpleFSDirectory}} looks to duplicate what's already offered by {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is using non-positional reads on the {{FileChannel}} (i.e., reads that are stateful, changing the current position), and {{SimpleFSDirectory}} therefore has to externally synchronize access to the read method. On Windows, positional reads are not supported, which is why {{FileChannel}} is already internally using synchronization to guarantee only access by one thread at a time for positional reads (see {{read(ByteBuffer dst, long position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, which returns true on Windows) and the JDK implementation for Windows is emulating positional reads by using non-positional ones, see [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139]. This means that on Windows, there should be no difference between {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it should be equally poor as both implementations only allow one thread at a time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to {{SimpleFSDirectory}}, however, as positional reads (pread) can be done concurrently. My proposal is to remove {{SimpleFSDirectory}} and replace its uses with {{NIOFsDirectory}}, given how similar these two directory implementations are ({{SimpleFSDirectory}} isn't really simpler). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8103) QueryValueSource should use TwoPhaseIterator
[ https://issues.apache.org/jira/browse/LUCENE-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michele Palmia updated LUCENE-8103: --- Attachment: LUCENE-8103.patch > QueryValueSource should use TwoPhaseIterator > > > Key: LUCENE-8103 > URL: https://issues.apache.org/jira/browse/LUCENE-8103 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/other >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-8103.patch > > > QueryValueSource (in "queries" module) is a ValueSource representation of a > Query; the score is the value. It ought to try to use a TwoPhaseIterator > from the query if it can be offered. This will prevent possibly expensive > advancing beyond documents that we aren't interested in. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8103) QueryValueSource should use TwoPhaseIterator
[ https://issues.apache.org/jira/browse/LUCENE-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052064#comment-17052064 ] Michele Palmia commented on LUCENE-8103: Why would a Scorer offer a fast TwoPhaseIterator but not serve it (repackaged as DocIdSetIterator) when asked for a simple old-school iterator()? In my naivety, I would expect that in case a fast TPI is implemented, it would always be served when clients call iterator() too. In case that's not the case, and an explicit repackage is useful, here's my patch. [^LUCENE-8103.patch] > QueryValueSource should use TwoPhaseIterator > > > Key: LUCENE-8103 > URL: https://issues.apache.org/jira/browse/LUCENE-8103 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/other >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-8103.patch > > > QueryValueSource (in "queries" module) is a ValueSource representation of a > Query; the score is the value. It ought to try to use a TwoPhaseIterator > from the query if it can be offered. This will prevent possibly expensive > advancing beyond documents that we aren't interested in. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14147) enable security manager by default
[ https://issues.apache.org/jira/browse/SOLR-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052069#comment-17052069 ] Cassandra Targett commented on SOLR-14147: -- I just noticed that this commit removed a section of the securing-solr.adoc page in the Ref Guide added by SOLR-13984 that for 8.x will explain how to enable the security manager. While I get the point of doing that - if it's enabled by default, users don't need to enable it - I would suggest that since there seem to be a couple of reasons why someone might need to disable it, instead of removing the section entirely it should have been edited to describe how to disable it. [~marcussorealheis], any objection to me adding the section back to master edited in this way? > enable security manager by default > -- > > Key: SOLR-14147 > URL: https://issues.apache.org/jira/browse/SOLR-14147 > Project: Solr > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > Fix For: master (9.0) > > Time Spent: 6h 20m > Remaining Estimate: 0h > > For 9.0, set SOLR_SECURITY_MANAGER_ENABLED=true by default. Remove the step > from securing solr page as it will be done by default (defaults become safe). > Users can disable if they are running hadoop or doing other crazy stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14147) enable security manager by default
[ https://issues.apache.org/jira/browse/SOLR-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052073#comment-17052073 ] ASF subversion and git services commented on SOLR-14147: Commit 74b9ba396c670cff7b738563475a92b8051f6690 in lucene-solr's branch refs/heads/master from Cassandra Targett [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=74b9ba3 ] SOLR-14147: comment out for now link to security manager docs in upgrade notes that don't exist on master > enable security manager by default > -- > > Key: SOLR-14147 > URL: https://issues.apache.org/jira/browse/SOLR-14147 > Project: Solr > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > Fix For: master (9.0) > > Time Spent: 6h 20m > Remaining Estimate: 0h > > For 9.0, set SOLR_SECURITY_MANAGER_ENABLED=true by default. Remove the step > from securing solr page as it will be done by default (defaults become safe). > Users can disable if they are running hadoop or doing other crazy stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13983) remove or replace process execution in SystemInfoHandler
[ https://issues.apache.org/jira/browse/SOLR-13983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-13983. Fix Version/s: 8.5 Resolution: Fixed > remove or replace process execution in SystemInfoHandler > > > Key: SOLR-13983 > URL: https://issues.apache.org/jira/browse/SOLR-13983 > Project: Solr > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > Fix For: 8.5 > > Attachments: SOLR-13983.patch > > > SystemInfoHandler is the only place in solr code executing processes. > Since solr is a server/long running process listening to HTTP, ideally > process execution could be disabled (e.g. with security manager). But first > this code needs to be removed or replaced, so that there is no legitimate use > of it: > {noformat} > try { > if (!Constants.WINDOWS) { > info.add( "uname", execute( "uname -a" ) ); > info.add( "uptime", execute( "uptime" ) ); > } > } catch( Exception ex ) { > log.warn("Unable to execute command line tools to get operating system > properties.", ex); > } > return info; > {noformat} > It already looks like its getting data from OS MXbean here, so maybe this > logic is simply outdated or not needed. It seems to be "best-effort" anyway. > Alternatively similar stuff could be fetched by reading from e.g. /proc file > system location if needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory
[ https://issues.apache.org/jira/browse/LUCENE-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052077#comment-17052077 ] Robert Muir commented on LUCENE-9264: - +1 > Remove SimpleFSDirectory in favor of NIOFsDirectory > --- > > Key: LUCENE-9264 > URL: https://issues.apache.org/jira/browse/LUCENE-9264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Yannick Welsch >Priority: Minor > > {{SimpleFSDirectory}} looks to duplicate what's already offered by > {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is > using non-positional reads on the {{FileChannel}} (i.e., reads that are > stateful, changing the current position), and {{SimpleFSDirectory}} therefore > has to externally synchronize access to the read method. > On Windows, positional reads are not supported, which is why {{FileChannel}} > is already internally using synchronization to guarantee only access by one > thread at a time for positional reads (see {{read(ByteBuffer dst, long > position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, > which returns true on Windows) and the JDK implementation for Windows is > emulating positional reads by using non-positional ones, see > [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139]. > This means that on Windows, there should be no difference between > {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it > should be equally poor as both implementations only allow one thread at a > time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to > {{SimpleFSDirectory}}, however, as positional reads (pread) can be done > concurrently. > My proposal is to remove {{SimpleFSDirectory}} and replace its uses with > {{NIOFsDirectory}}, given how similar these two directory implementations are > ({{SimpleFSDirectory}} isn't really simpler). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov commented on a change in pull request #1313: LUCENE-8962: Split test case
msokolov commented on a change in pull request #1313: LUCENE-8962: Split test case URL: https://github.com/apache/lucene-solr/pull/1313#discussion_r388264716 ## File path: lucene/core/src/test/org/apache/lucene/index/TestIndexWriterMergePolicy.java ## @@ -298,63 +320,44 @@ public void testMergeOnCommit() throws IOException, InterruptedException { DirectoryReader firstReader = DirectoryReader.open(firstWriter); assertEquals(5, firstReader.leaves().size()); firstReader.close(); -firstWriter.close(); - -MergePolicy mergeOnCommitPolicy = new LogDocMergePolicy() { - @Override - public MergeSpecification findFullFlushMerges(MergeTrigger mergeTrigger, SegmentInfos segmentInfos, MergeContext mergeContext) { -// Optimize down to a single segment on commit -if (mergeTrigger == MergeTrigger.COMMIT && segmentInfos.size() > 1) { - List nonMergingSegments = new ArrayList<>(); - for (SegmentCommitInfo sci : segmentInfos) { -if (mergeContext.getMergingSegments().contains(sci) == false) { - nonMergingSegments.add(sci); -} - } - if (nonMergingSegments.size() > 1) { -MergeSpecification mergeSpecification = new MergeSpecification(); -mergeSpecification.add(new OneMerge(nonMergingSegments)); -return mergeSpecification; - } -} -return null; - } -}; +firstWriter.close(); // When this writer closes, it does not merge on commit. -AtomicInteger abandonedMerges = new AtomicInteger(0); IndexWriterConfig iwc = newIndexWriterConfig(new MockAnalyzer(random())) -.setMergePolicy(mergeOnCommitPolicy) -.setIndexWriterEvents(new IndexWriterEvents() { - @Override - public void beginMergeOnCommit() { - - } - - @Override - public void finishMergeOnCommit() { +.setMergePolicy(MERGE_ON_COMMIT_POLICY); - } - - @Override - public void abandonedMergesOnCommit(int abandonedCount) { -abandonedMerges.incrementAndGet(); - } -}); IndexWriter writerWithMergePolicy = new IndexWriter(dir, iwc); - -writerWithMergePolicy.commit(); +writerWithMergePolicy.commit(); // No changes. Commit doesn't trigger a merge. DirectoryReader unmergedReader = DirectoryReader.open(writerWithMergePolicy); -assertEquals(5, unmergedReader.leaves().size()); // Don't merge unless there's a change +assertEquals(5, unmergedReader.leaves().size()); unmergedReader.close(); TestIndexWriter.addDoc(writerWithMergePolicy); -writerWithMergePolicy.commit(); +writerWithMergePolicy.commit(); // Doc added, do merge on commit. +assertEquals(1, writerWithMergePolicy.getSegmentCount()); // DirectoryReader mergedReader = DirectoryReader.open(writerWithMergePolicy); -assertEquals(1, mergedReader.leaves().size()); // Now we merge on commit +assertEquals(1, mergedReader.leaves().size()); mergedReader.close(); +try (IndexReader reader = writerWithMergePolicy.getReader()) { + IndexSearcher searcher = new IndexSearcher(reader); + assertEquals(6, reader.numDocs()); + assertEquals(6, searcher.count(new MatchAllDocsQuery())); +} + +writerWithMergePolicy.close(); +dir.close(); + } + + // Test that when we have multiple indexing threads merging on commit, we never throw an exception. + @Nightly Review comment: Yes, I think given it does not assert anything -- just makes sure no exceptions occur -- we should already be well-covered. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] juanka588 commented on a change in pull request #1320: LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes.
juanka588 commented on a change in pull request #1320: LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes. URL: https://github.com/apache/lucene-solr/pull/1320#discussion_r388274441 ## File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/FieldReader.java ## @@ -82,32 +80,11 @@ // System.out.println("BTTR: seg=" + segment + " field=" + fieldInfo.name + " rootBlockCode=" + rootCode + " divisor=" + indexDivisor); // } rootBlockFP = (new ByteArrayDataInput(rootCode.bytes, rootCode.offset, rootCode.length)).readVLong() >>> BlockTreeTermsReader.OUTPUT_FLAGS_NUM_BITS; -// Initialize FST offheap if index is MMapDirectory and -// docCount != sumDocFreq implying field is not primary key +// Initialize FST always off-heap. if (indexIn != null) { - switch (fstLoadMode) { -case ON_HEAP: - isFSTOffHeap = false; - break; -case OFF_HEAP: - isFSTOffHeap = true; - break; -case OPTIMIZE_UPDATES_OFF_HEAP: - isFSTOffHeap = ((this.docCount != this.sumDocFreq) || openedFromWriter == false); - break; -case AUTO: - isFSTOffHeap = ((this.docCount != this.sumDocFreq) || openedFromWriter == false) && indexIn instanceof ByteBufferIndexInput; - break; -default: - throw new IllegalStateException("unknown enum constant: " + fstLoadMode); - } final IndexInput clone = indexIn.clone(); clone.seek(indexStartFP); - if (isFSTOffHeap) { -index = new FST<>(clone, ByteSequenceOutputs.getSingleton(), new OffHeapFSTStore()); - } else { -index = new FST<>(clone, ByteSequenceOutputs.getSingleton()); - } + index = new FST<>(clone, ByteSequenceOutputs.getSingleton(), new OffHeapFSTStore()); Review comment: nice This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] ctargett commented on a change in pull request #1292: SOLR-14284 add expressible support to list, and add example of removing a component
ctargett commented on a change in pull request #1292: SOLR-14284 add expressible support to list, and add example of removing a component URL: https://github.com/apache/lucene-solr/pull/1292#discussion_r388300341 ## File path: solr/solr-ref-guide/src/stream-api.adoc ## @@ -0,0 +1,210 @@ += Stream Request Handler API +:page-toclevels: 1 +:page-tocclass: right +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + +These API commands work with the `/stream` request handler. + Review comment: Since there are only really 2 types of actions possible - 1 to list all expressions available, and 4 to manipulate daemon streams - I think it might be helpful for users to state that somewhat limited scope upfront. Also a link from the daemon expression to this new page would be appropriate IMO, and less importantly a link to the Stream handler documentation (unless this is intended to be a child of that page - not clear where this is intended to live). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] ctargett commented on issue #1292: SOLR-14284 add expressible support to list, and add example of removing a component
ctargett commented on issue #1292: SOLR-14284 add expressible support to list, and add example of removing a component URL: https://github.com/apache/lucene-solr/pull/1292#issuecomment-595237017 The changes here are slightly confusing because the descriptions of the Jira and the PR refer to documenting `add-expressible` (and etc.), but there are examples added for `delete-requesthandler`, which is tangential & sort of thrown in there. While the title of this PR mentions it, the descriptions don't so it's hard to know what to expect...it's fine in the end, it's just a barrier to a review to note for future PRs. I left another specific comment on the new page, and generally that content is good. However, I know it will fail the build because there is no edit to a page to include the new page as a child of another (so it does not currently fit anywhere in the page hierarchy), which means you didn't run the build first and there could be other issues that need to be resolved before this can be committed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14284) Document that you can add a new stream function via add-expressible
[ https://issues.apache.org/jira/browse/SOLR-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052157#comment-17052157 ] Cassandra Targett commented on SOLR-14284: -- I took a pass on review of this today. I can't totally vouch for its accuracy, but the content is good IMO. The PR is missing a couple things though before it can be committed - biggest is to put the page into the page hierarchy by adding it as a child of another page. Otherwise the build will fail since the new page technically doesn't belong anywhere. > Document that you can add a new stream function via add-expressible > --- > > Key: SOLR-14284 > URL: https://issues.apache.org/jira/browse/SOLR-14284 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation >Affects Versions: 8.5 >Reporter: David Eric Pugh >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > I confirmed that in Solr 8.5 you will be able to dynamically add a Stream > function (assuming the Jar is in the path) via the configset api: > curl -X POST -H 'Content-type:application/json' -d '{ > "add-expressible": { > "name": "dog", > "class": "org.apache.solr.handler.CatStream" > } > }' http://localhost:8983/solr/gettingstarted/config -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14284) Document that you can add a new stream function via add-expressible
[ https://issues.apache.org/jira/browse/SOLR-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052162#comment-17052162 ] Eric Pugh commented on SOLR-14284: -- Thanks! I suspect I goofed the commit. “Works on my laptop” said every developer :-) On Thu, Mar 5, 2020 at 8:54 AM Cassandra Targett (Jira) > Document that you can add a new stream function via add-expressible > --- > > Key: SOLR-14284 > URL: https://issues.apache.org/jira/browse/SOLR-14284 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation >Affects Versions: 8.5 >Reporter: David Eric Pugh >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > I confirmed that in Solr 8.5 you will be able to dynamically add a Stream > function (assuming the Jar is in the path) via the configset api: > curl -X POST -H 'Content-type:application/json' -d '{ > "add-expressible": { > "name": "dog", > "class": "org.apache.solr.handler.CatStream" > } > }' http://localhost:8983/solr/gettingstarted/config -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] ctargett commented on a change in pull request #1292: SOLR-14284 add expressible support to list, and add example of removing a component
ctargett commented on a change in pull request #1292: SOLR-14284 add expressible support to list, and add example of removing a component URL: https://github.com/apache/lucene-solr/pull/1292#discussion_r388311699 ## File path: solr/solr-ref-guide/src/stream-api.adoc ## @@ -0,0 +1,210 @@ += Stream Request Handler API +:page-toclevels: 1 +:page-tocclass: right +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + +These API commands work with the `/stream` request handler. + Review comment: Also, it seems that the v2 API structure is not supported with this API? If that's true, we might want to state that somewhere outright. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13919) RefGuide: Add example for AuditLogger to use log4j to log into separate files
[ https://issues.apache.org/jira/browse/SOLR-13919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cassandra Targett updated SOLR-13919: - Component/s: documentation > RefGuide: Add example for AuditLogger to use log4j to log into separate files > - > > Key: SOLR-13919 > URL: https://issues.apache.org/jira/browse/SOLR-13919 > Project: Solr > Issue Type: Improvement > Components: documentation, SolrCloud >Affects Versions: 8.3 >Reporter: Jörn Franke >Priority: Minor > > At the moment, the Solr reference guide provides an example on how to log > audit events to the standard Solr log (see > [https://lucene.apache.org/solr/guide/8_3/audit-logging.html).] Those events > are logged in the standard Solr log. > This enhancement proposes to include a simple explanation in the reference > guide on how to configure log4j to log audit events in a separate file (this > is possible already now with log4j, this issue is just about adding an > example log4j configuration file for the Solr audit logger). > Reasoning behind this is that it can reduce the load on a SIEM system > significantly as it only needs to process the relevant audit logs. > To be discussed: Should there be a standard log4j configuration when > installing Solr to log into a separate file (maybe even the same log > directory by default) all audit event -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-12865) Custom JSON parser's nested documents example does not work
[ https://issues.apache.org/jira/browse/SOLR-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cassandra Targett updated SOLR-12865: - Component/s: documentation > Custom JSON parser's nested documents example does not work > --- > > Key: SOLR-12865 > URL: https://issues.apache.org/jira/browse/SOLR-12865 > Project: Solr > Issue Type: Bug > Components: documentation >Affects Versions: 7.5 >Reporter: Alexandre Rafalovitch >Priority: Major > Labels: json > > The only exam we have for indexing nested JSON using JSON parser does not > seem to work: > [https://lucene.apache.org/solr/guide/7_5/transforming-and-indexing-custom-json.html#indexing-nested-documents] > Attempt 1, using default schemaless mode: > # bin/solr create -c json_basic > # Example command in V1 format (with core name switched to above) > # Indexing fails with: *"msg":"[doc=null] missing required field: id"*. My > guess it is because the URPs chain do not apply to inner children records > Attempt 2, using techproducts schema configuration: > # bin/solr create -c json_tp -d sample_techproducts_configs > # Same example command with new core > # Indexing fails with: *"msg":"Raw data can be stored only if split=/"* (due > to presence of srcField in the params.json) > Attempt 3, continuing the above example, but taking out srcField > configuration: > # Update params.json to remove srcField > # Same example command > # It indexes (but not commits) > # curl http://localhost:8983/solr/json_tp/update/json -v -d '\{commit:{}} > # The core now contains only one document with auto-generated "id" and > "_version_" field (because we have mapUniqueKeyOnly in params.json) > Attempt 4, removing more keys > # Update params.json to remove mapUniqueKeyOnly > # Same example command > # Indexing fails with: *"msg":"Document is missing mandatory uniqueKey > field: id"* > There does not seem to be way to index the nested JSON using the transformer > approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13120) Bad Documentation Link
[ https://issues.apache.org/jira/browse/SOLR-13120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cassandra Targett resolved SOLR-13120. -- Resolution: Won't Fix The page in question hadn't been part of our official docs for a long while, and since the migration to cwiki in Summer 2019, it appears to be entirely gone. > Bad Documentation Link > -- > > Key: SOLR-13120 > URL: https://issues.apache.org/jira/browse/SOLR-13120 > Project: Solr > Issue Type: Task >Reporter: Kyle Cundari >Priority: Major > > In the Solr Docs: [https://wiki.apache.org/solr/CommonQueryParameters] > > There is a bad link ("full cursorMark deep paging example") Under the "Deep > paging with cursorMark" header. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] epugh commented on issue #1292: SOLR-14284 add expressible support to list, and add example of removing a component
epugh commented on issue #1292: SOLR-14284 add expressible support to list, and add example of removing a component URL: https://github.com/apache/lucene-solr/pull/1292#issuecomment-595269969 Thanks @ctargett for the review. Do you want me to pull that `delete-requesthandler` into another Jira issue? Totally happy to do that, and your feedback makes complete sense about it being a barrier. I am looking at the other comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory
[ https://issues.apache.org/jira/browse/LUCENE-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052214#comment-17052214 ] Bruno Roustant commented on LUCENE-9264: +1 > Remove SimpleFSDirectory in favor of NIOFsDirectory > --- > > Key: LUCENE-9264 > URL: https://issues.apache.org/jira/browse/LUCENE-9264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Yannick Welsch >Priority: Minor > > {{SimpleFSDirectory}} looks to duplicate what's already offered by > {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is > using non-positional reads on the {{FileChannel}} (i.e., reads that are > stateful, changing the current position), and {{SimpleFSDirectory}} therefore > has to externally synchronize access to the read method. > On Windows, positional reads are not supported, which is why {{FileChannel}} > is already internally using synchronization to guarantee only access by one > thread at a time for positional reads (see {{read(ByteBuffer dst, long > position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, > which returns true on Windows) and the JDK implementation for Windows is > emulating positional reads by using non-positional ones, see > [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139]. > This means that on Windows, there should be no difference between > {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it > should be equally poor as both implementations only allow one thread at a > time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to > {{SimpleFSDirectory}}, however, as positional reads (pread) can be done > concurrently. > My proposal is to remove {{SimpleFSDirectory}} and replace its uses with > {{NIOFsDirectory}}, given how similar these two directory implementations are > ({{SimpleFSDirectory}} isn't really simpler). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] ctargett commented on a change in pull request #1291: LUCENE-9016: RefGuide meta doc for how to publish website
ctargett commented on a change in pull request #1291: LUCENE-9016: RefGuide meta doc for how to publish website URL: https://github.com/apache/lucene-solr/pull/1291#discussion_r388348872 ## File path: solr/solr-ref-guide/src/meta-docs/publish.adoc ## @@ -47,61 +47,26 @@ To build the HTML: [source,bash] $ ant clean default + -This will produce pages with a DRAFT watermark across them. While these are fine for initial DRAFT publication, see the section <> for steps to produce final production-ready HTML pages. +This will produce pages with a DRAFT watermark across them. While these are fine for initial DRAFT publication, see the section <> for steps to produce final production-ready HTML pages. . The resulting Guide will be in `solr/build/solr-ref-guide`. The HTML files themselves will be in `solr/build/solr-ref-guide/html-site`. Review comment: Since you changed the heading this refers to back to its original, this reference should be changed too. This is why precommit failed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] ctargett commented on issue #1291: LUCENE-9016: RefGuide meta doc for how to publish website
ctargett commented on issue #1291: LUCENE-9016: RefGuide meta doc for how to publish website URL: https://github.com/apache/lucene-solr/pull/1291#issuecomment-595274181 Sorry, I approved this and then looked at why precommit failed and there is still 1 page reference that is incorrect (since you changed the section title back to what it was originally). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] ctargett commented on issue #1292: SOLR-14284 add expressible support to list, and add example of removing a component
ctargett commented on issue #1292: SOLR-14284 add expressible support to list, and add example of removing a component URL: https://github.com/apache/lucene-solr/pull/1292#issuecomment-595275870 It's fine to keep it here, it just was a disconnect at first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] epugh commented on a change in pull request #1292: SOLR-14284 add expressible support to list, and add example of removing a component
epugh commented on a change in pull request #1292: SOLR-14284 add expressible support to list, and add example of removing a component URL: https://github.com/apache/lucene-solr/pull/1292#discussion_r388354635 ## File path: solr/solr-ref-guide/src/stream-api.adoc ## @@ -0,0 +1,210 @@ += Stream Request Handler API +:page-toclevels: 1 +:page-tocclass: right +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + +These API commands work with the `/stream` request handler. + Review comment: I reworked the intro, and used the `NOTE` to call out the lack of following v2 API structure. Also, there is a link from the Daemon expression detail to this page and vice versa, and I modified the `streaming-expressions.adoc` to have `stream-api.adoc` be a child page. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] epugh commented on issue #1292: SOLR-14284 add expressible support to list, and add example of removing a component
epugh commented on issue #1292: SOLR-14284 add expressible support to list, and add example of removing a component URL: https://github.com/apache/lucene-solr/pull/1292#issuecomment-595290097 I think I have responded and pushed up all changes. Thank you for reviewing, and let me know if any other changes are needed! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"
[ https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052294#comment-17052294 ] Tomoko Uchida commented on SOLR-11746: -- [~ctargett] I've used asciidoctor 1.5.6.2, and after updating its version {{ant build-site}} started to work for me again. Thank you! And yes, with Gradle build we shouldn't have such problems. :) (I didn't know jruby-gradle-plugin but it seems to work just as [Bundler|https://bundler.io/], the defact dependency management tool for Ruby. > numeric fields need better error handling for prefix/wildcard syntax -- > consider uniform support for "foo:* == foo:[* TO *]" > > > Key: SOLR-11746 > URL: https://issues.apache.org/jira/browse/SOLR-11746 > Project: Solr > Issue Type: Bug >Affects Versions: 7.0 >Reporter: Chris M. Hostetter >Assignee: Houston Putman >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, > SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, > SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch > > > On the solr-user mailing list, Torsten Krah pointed out that with Trie > numeric fields, query syntax such as {{foo_d:\*}} has been functionality > equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported > for Point based numeric fields. > The fact that this type of syntax works (for {{indexed="true"}} Trie fields) > appears to have been an (untested, undocumented) fluke of Trie fields given > that they use indexed terms for the (encoded) numeric terms and inherit the > default implementation of {{FieldType.getPrefixQuery}} which produces a > prefix query against the {{""}} (empty string) term. > (Note that this syntax has aparently _*never*_ worked for Trie fields with > {{indexed="false" docValues="true"}} ) > In general, we should assess the behavior users attempt a prefix/wildcard > syntax query against numeric fields, as currently the behavior is largely > non-sensical: prefix/wildcard syntax frequently match no docs w/o any sort > of error, and the aformentioned {{numeric_field:*}} behaves inconsistently > between points/trie fields and between indexed/docValued trie fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-12325) introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet
[ https://issues.apache.org/jira/browse/SOLR-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052324#comment-17052324 ] Munendra S N commented on SOLR-12325: - +1 to adding additional test Few nitpicks * Remove any usage of System.out.* in the patch * matchPart has bug, as it returns on the if both the keys match. Also, instead of returning just {{err}}, I think it would be better if we include the key for which compare failed {code:java} for (String key: keys) { if ( (((Map) inputObj2).get(key)).equals(((Map) inputObj1).get(key))) { return null; // the culprit } else { return "err"; } } {code} * Also, do we need logging in {{matchTwoJSONs}} as already we are throwing exception? If log is for some purpose then, we can keep else, we can avoid it. Method would be cleaner without failed flags [~mkhl] should we close this as branch_8_5 is cut from master? > introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet > -- > > Key: SOLR-12325 > URL: https://issues.apache.org/jira/browse/SOLR-12325 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Reporter: Mikhail Khludnev >Assignee: Mikhail Khludnev >Priority: Major > Fix For: 8.5 > > Attachments: SOLR-12325.patch, SOLR-12325.patch, SOLR-12325.patch, > SOLR-12325.patch, SOLR-12325.patch, > SOLR-12325_Random_test_for_uniqueBlockQuery (1).patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > It might be faster twin for {{uniqueBlock(\_root_)}}. Please utilise buildin > query parsing method, don't invent your own. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052328#comment-17052328 ] Michael Gibney commented on SOLR-13807: --- Regarding TermFacetCacheRegenerator, my understanding of CacheHelper.getKey() is that the returned keys should work the same way at the segment level that they do at the top level; notably, that the types of modifications you mention (deletes, in-place DV updates, etc.) should result in the creation of a new cache key. Is that not true? {{countCacheDf}} is defined wrt the main domain DocSet.size(), and only affects whether the {{termFacetCache}} is consulted for a given domain-request combination. It should _not_ affect the cached values themselves, if that's your concern. As far as the temporarily tabled concerns about concurrent mutation, this was something I considered, and (I think) addressed [here|https://github.com/apache/lucene-solr/pull/751/files#diff-1b16fc96c8dde547ddde619e54a45c26R1158-R1161]: {code:java} if (segmentCache == null) { // no cache presence; initialize. cacheState = CacheState.NOT_CACHED; newSegmentCache = new HashMap<>(fcontext.searcher.getIndexReader().leaves().size() + 1); } else if (segmentCache.containsKey(topLevelKey)) { topLevelEntry = segmentCache.get(topLevelKey); CachedCountSlotAcc acc = new CachedCountSlotAcc(fcontext, topLevelEntry.topLevelCounts); return new SweepCountAccStruct(qKey, docs, CacheState.CACHED, null, isBase, acc, new ReadOnlyCountSlotAccWrapper(fcontext, acc), acc); } else { // defensive copy, since cache entries are shared across threads cacheState = CacheState.PARTIALLY_CACHED; newSegmentCache = new HashMap<>(fcontext.searcher.getIndexReader().leaves().size() + 1); newSegmentCache.putAll(segmentCache); } {code} In that last {{else}} block, each domain-request combination that finds a partial cache entry (with some segments populated), creates and populates an entirely new, request-private top-level cache entry (initially sharing the immutable segment-level entries from the extant top-level entry). On completion of processing, this new top-level entry is placed atomically into the termFacetCache. I believe this should be robust; and if indeed robust, at worst you'd end up with concurrent requests each doing the work of creating equivalent top-level cache entries, the last of which would remain in the cache ... which should be no worse than the status quo, where each request always does all the work of recalculating facet counts. > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807__SOLR-13132_test_stub.patch > > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in co
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052350#comment-17052350 ] Chris M. Hostetter commented on SOLR-13807: --- bq. my understanding of CacheHelper.getKey() is that the returned keys ... that the types of modifications you mention (deletes, in-place DV updates, etc.) should result in the creation of a new cache key. Is that not true? I don't know ... it's not something i've looked into in depth, if so then false alarm (but we should double check, and ideally prove it w/a defensive white box test of the regenerator after doing some deletes/in-place updates) bq. countCacheDf is defined wrt the main domain DocSet.size(), and only affects whether the termFacetCache is consulted for a given domain-request combination ... Oh, oh OH ! ... ok that explains so much about what i was seeing in cache stats after various requests. For some reason I thought it controlled whether individual term=counts were being cached -- which reminds me: we need ref-guide updates in the PR : ) bq. ...As far as the temporarily tabled concerns about concurrent mutation... Those concerns were largely related to my mistaken impression that different requests w/different {{countCacheDf}} params were causing the original segment level cache values to be mutated in place (w/o doing a new "insert" back into the cache) because that's what i convinced myself was happening to explain the cache stats i was seeing and my vague (missguided) assumptions about how/why {{CacheState.PARTIALLY_CACHED}} existed from skimming the code. Your point about doing a defensive copy of the segment level counts & atomic re-insert of the top level entry after updating the counts for the new segments makes perfect sense. > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807__SOLR-13132_test_stub.patch > > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in conjunction with > variant requests over the same result domain (this would support common use > cases like paging, but also potentially more interesting direct uses of > facets). > # could be used for distributed refinement (i.e., if facet counts over a > given domain are cached, a refinement request could simply look up the > ordinal value for each enumerated term and directly grab the count out of the > count array that was cached during the first phase of facet calculation) > # composable (e.g., in aggregate functions that calculate values based on > facet counts across different domains, like SKG/relatedness – see SOLR-13132) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For a
[jira] [Commented] (LUCENE-9016) Document how to update web site
[ https://issues.apache.org/jira/browse/LUCENE-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052352#comment-17052352 ] ASF subversion and git services commented on LUCENE-9016: - Commit ceb90ce0e8e8996a524c314397b7a8e38f4a4796 in lucene-solr's branch refs/heads/master from Jan Høydahl [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ceb90ce ] LUCENE-9016: RefGuide meta doc for how to publish website (#1291) > Document how to update web site > --- > > Key: LUCENE-9016 > URL: https://issues.apache.org/jira/browse/LUCENE-9016 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > Find all documentation across Wiki, RefGuide, scripts and website itself that > talks about how to update or publish the web site, and update accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy merged pull request #1291: LUCENE-9016: RefGuide meta doc for how to publish website
janhoy merged pull request #1291: LUCENE-9016: RefGuide meta doc for how to publish website URL: https://github.com/apache/lucene-solr/pull/1291 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats. URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054 **Benchmarks** sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims. ``` APPROACH RECALL QPS LuceneExact() 1.0006.425 LuceneCluster(n_probes=5) 0.749 574.186 LuceneCluster(n_probes=10)0.874 308.455 LuceneCluster(n_probes=20)0.951 116.871 LuceneCluster(n_probes=50)0.993 67.354 LuceneCluster(n_probes=100) 0.999 34.651 ``` glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims. ``` APPROACH RECALL QPS LuceneExact() 1.0006.722 LuceneCluster(n_probes=5) 0.680 618.438 LuceneCluster(n_probes=10)0.766 335.956 LuceneCluster(n_probes=20)0.835 173.782 LuceneCluster(n_probes=50)0.905 72.747 LuceneCluster(n_probes=100) 0.948 37.339 ``` These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats. URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054 **Benchmarks** sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims. ``` APPROACH RECALL QPS LuceneExact() 1.0006.425 LuceneCluster(n_probes=2) 0.536 1138.926 LuceneCluster(n_probes=5) 0.749 574.186 LuceneCluster(n_probes=10)0.874 308.455 LuceneCluster(n_probes=20)0.951 116.871 LuceneCluster(n_probes=50)0.993 67.354 LuceneCluster(n_probes=100) 0.999 34.651 ``` glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims. ``` APPROACH RECALL QPS LuceneExact() 1.0006.722 LuceneCluster(n_probes=5) 0.680 618.438 LuceneCluster(n_probes=10)0.766 335.956 LuceneCluster(n_probes=20)0.835 173.782 LuceneCluster(n_probes=50)0.905 72.747 LuceneCluster(n_probes=100) 0.948 37.339 ``` These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9016) Document how to update web site
[ https://issues.apache.org/jira/browse/LUCENE-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl closed LUCENE-9016. --- > Document how to update web site > --- > > Key: LUCENE-9016 > URL: https://issues.apache.org/jira/browse/LUCENE-9016 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > Find all documentation across Wiki, RefGuide, scripts and website itself that > talks about how to update or publish the web site, and update accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9016) Document how to update web site
[ https://issues.apache.org/jira/browse/LUCENE-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl resolved LUCENE-9016. - Resolution: Fixed > Document how to update web site > --- > > Key: LUCENE-9016 > URL: https://issues.apache.org/jira/browse/LUCENE-9016 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > Find all documentation across Wiki, RefGuide, scripts and website itself that > talks about how to update or publish the web site, and update accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9016) Document how to update web site
[ https://issues.apache.org/jira/browse/LUCENE-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052354#comment-17052354 ] ASF subversion and git services commented on LUCENE-9016: - Commit ebe35df13a12ad912d7edc03020e6273371c1acf in lucene-solr's branch refs/heads/branch_8x from Jan Høydahl [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ebe35df ] LUCENE-9016: RefGuide meta doc for how to publish website (#1291) (cherry picked from commit ceb90ce0e8e8996a524c314397b7a8e38f4a4796) > Document how to update web site > --- > > Key: LUCENE-9016 > URL: https://issues.apache.org/jira/browse/LUCENE-9016 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > Find all documentation across Wiki, RefGuide, scripts and website itself that > talks about how to update or publish the web site, and update accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats. URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054 **Benchmarks** In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims. ``` APPROACH RECALL QPS LuceneExact() 1.0006.425 LuceneCluster(n_probes=5) 0.749 574.186 LuceneCluster(n_probes=10)0.874 308.455 LuceneCluster(n_probes=20)0.951 116.871 LuceneCluster(n_probes=50)0.993 67.354 LuceneCluster(n_probes=100) 0.999 34.651 ``` glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims. ``` APPROACH RECALL QPS LuceneExact() 1.0006.722 LuceneCluster(n_probes=5) 0.680 618.438 LuceneCluster(n_probes=10)0.766 335.956 LuceneCluster(n_probes=20)0.835 173.782 LuceneCluster(n_probes=50)0.905 72.747 LuceneCluster(n_probes=100) 0.948 37.339 ``` These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on a more robust benchmarking set-up. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats. URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054 **Benchmarks** In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims. ``` APPROACH RECALL QPS LuceneExact() 1.0006.425 LuceneCluster(n_probes=5) 0.749 574.186 LuceneCluster(n_probes=10)0.874 308.455 LuceneCluster(n_probes=20)0.951 116.871 LuceneCluster(n_probes=50)0.993 67.354 LuceneCluster(n_probes=100) 0.999 34.651 ``` glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims. ``` APPROACH RECALL QPS LuceneExact() 1.0006.722 LuceneCluster(n_probes=5) 0.680 618.438 LuceneCluster(n_probes=10)0.766 335.956 LuceneCluster(n_probes=20)0.835 173.782 LuceneCluster(n_probes=50)0.905 72.747 LuceneCluster(n_probes=100) 0.948 37.339 ``` These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14306) Refactor coordination code into separate module and evaluate using Curator
[ https://issues.apache.org/jira/browse/SOLR-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052383#comment-17052383 ] Tomas Eduardo Fernandez Lobbe commented on SOLR-14306: -- Thanks Jan. I think with the right interfaces, we should be able to replace the underlying implementation we use for coordination (either one of those you suggested or maybe others we haven't thought of). While make them pluggable is out of the scope of this particular SIP, I think it's a step on that direction. If we decide to not make it pluggable, and never to change ZooKeeper, this is still important to improve testing IMO. > Refactor coordination code into separate module and evaluate using Curator > -- > > Key: SOLR-14306 > URL: https://issues.apache.org/jira/browse/SOLR-14306 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomas Eduardo Fernandez Lobbe >Priority: Major > > This Jira issue is to discuss two changes that unfortunately are difficult to > address separately > # Separate all ZooKeeper coordination logic into it’s own module, that can > be tested in isolation > # Evaluate using Apache Curator for coordination instead of our own logic. > I drafted a > [SIP|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148640472], > but this is very much WIP, I’d like to hear opinions before I spend too much > time on something people hates. > From the initial draft of the SIP: > {quote}The main goal of this change is to allow better testing of the > different ZooKeeper interactions related to coordination (leader election, > queues, etc). There are already some abstractions in place for lower level > operations (set-data, get-data, etc, see DistribStateManager), so the idea is > to have a new, related abstraction named CoordinationManager, where we could > have some higher level coordination-related classes, like LeaderRunner > (Overseer), LeaderLatch (for shard leaders), etc. Curator comes into place > because, in order to refactor the existing code into these new abstractions, > we’d have to rework much of it, so we could instead consider using Curator, a > library that was mentioned in the past many times. While I don’t think this > is required, It would make this transition and our code simpler (from what I > could see, however, input from people with more Curator experience would be > greatly appreciated). > While it would be out of the scope of this change, If the > abstractions/interfaces are correctly designed, this could lead to, in the > future, be able to use something other than ZooKeeper for coordination, > either etcd or maybe even some in-memory replacement for tests. > {quote} > There are still many open questions, and many questions I still don’t know > we’ll have, but please, let me know if you have any early feedback, specially > if you’ve worked with Curator in the past. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14308) Multi-threaded facet.query
Gregory Koldirkaev created SOLR-14308: - Summary: Multi-threaded facet.query Key: SOLR-14308 URL: https://issues.apache.org/jira/browse/SOLR-14308 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: faceting, search Reporter: Gregory Koldirkaev Add multi-threading support for facet.query. Facet.threads can be used for this purpose just like for facet.field. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mikemccand commented on issue #1319: LUCENE-9164: process all events before closing gracefully
mikemccand commented on issue #1319: LUCENE-9164: process all events before closing gracefully URL: https://github.com/apache/lucene-solr/pull/1319#issuecomment-595367820 I'll try to review this one soon! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory
[ https://issues.apache.org/jira/browse/LUCENE-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052390#comment-17052390 ] Michael McCandless commented on LUCENE-9264: +1 > Remove SimpleFSDirectory in favor of NIOFsDirectory > --- > > Key: LUCENE-9264 > URL: https://issues.apache.org/jira/browse/LUCENE-9264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Yannick Welsch >Priority: Minor > > {{SimpleFSDirectory}} looks to duplicate what's already offered by > {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is > using non-positional reads on the {{FileChannel}} (i.e., reads that are > stateful, changing the current position), and {{SimpleFSDirectory}} therefore > has to externally synchronize access to the read method. > On Windows, positional reads are not supported, which is why {{FileChannel}} > is already internally using synchronization to guarantee only access by one > thread at a time for positional reads (see {{read(ByteBuffer dst, long > position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, > which returns true on Windows) and the JDK implementation for Windows is > emulating positional reads by using non-positional ones, see > [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139]. > This means that on Windows, there should be no difference between > {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it > should be equally poor as both implementations only allow one thread at a > time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to > {{SimpleFSDirectory}}, however, as positional reads (pread) can be done > concurrently. > My proposal is to remove {{SimpleFSDirectory}} and replace its uses with > {{NIOFsDirectory}}, given how similar these two directory implementations are > ({{SimpleFSDirectory}} isn't really simpler). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory
[ https://issues.apache.org/jira/browse/LUCENE-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052409#comment-17052409 ] Adrien Grand commented on LUCENE-9264: -- +1 [~ywelsch] would you like to open a pull request? > Remove SimpleFSDirectory in favor of NIOFsDirectory > --- > > Key: LUCENE-9264 > URL: https://issues.apache.org/jira/browse/LUCENE-9264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Yannick Welsch >Priority: Minor > > {{SimpleFSDirectory}} looks to duplicate what's already offered by > {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is > using non-positional reads on the {{FileChannel}} (i.e., reads that are > stateful, changing the current position), and {{SimpleFSDirectory}} therefore > has to externally synchronize access to the read method. > On Windows, positional reads are not supported, which is why {{FileChannel}} > is already internally using synchronization to guarantee only access by one > thread at a time for positional reads (see {{read(ByteBuffer dst, long > position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, > which returns true on Windows) and the JDK implementation for Windows is > emulating positional reads by using non-positional ones, see > [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139]. > This means that on Windows, there should be no difference between > {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it > should be equally poor as both implementations only allow one thread at a > time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to > {{SimpleFSDirectory}}, however, as positional reads (pread) can be done > concurrently. > My proposal is to remove {{SimpleFSDirectory}} and replace its uses with > {{NIOFsDirectory}}, given how similar these two directory implementations are > ({{SimpleFSDirectory}} isn't really simpler). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on issue #1320: LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes.
jpountz commented on issue #1320: LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes. URL: https://github.com/apache/lucene-solr/pull/1320#issuecomment-595382926 In the spirit of @dsmiley 's recent email, let's add a CHANGES entry? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14306) Refactor coordination code into separate module and evaluate using Curator
[ https://issues.apache.org/jira/browse/SOLR-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052420#comment-17052420 ] Mike Drob commented on SOLR-14306: -- I wonder if the double change of coordination module + curator migration is going to cause us to miss something due to too many moving parts, or make it harder to review and understand the changes and prevent regressions. I also am concerned that if we do both changes at the same time we end up with a bad abstraction that looks ok but is actually very Curator specific. Why do you believe that these issues are difficult to address separately? I really like the idea of having higher level abstractions in place - are the overseer and shard leader election code paths using common tools right now, or is each implemented separately. I haven't been in that part of Solr recently, so I don't know what the current state looks like. I know that [~marcussorealheis] has looked at efforts to swap zookeeper for etcd in the past, so he probably has thoughts here too. > Refactor coordination code into separate module and evaluate using Curator > -- > > Key: SOLR-14306 > URL: https://issues.apache.org/jira/browse/SOLR-14306 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomas Eduardo Fernandez Lobbe >Priority: Major > > This Jira issue is to discuss two changes that unfortunately are difficult to > address separately > # Separate all ZooKeeper coordination logic into it’s own module, that can > be tested in isolation > # Evaluate using Apache Curator for coordination instead of our own logic. > I drafted a > [SIP|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148640472], > but this is very much WIP, I’d like to hear opinions before I spend too much > time on something people hates. > From the initial draft of the SIP: > {quote}The main goal of this change is to allow better testing of the > different ZooKeeper interactions related to coordination (leader election, > queues, etc). There are already some abstractions in place for lower level > operations (set-data, get-data, etc, see DistribStateManager), so the idea is > to have a new, related abstraction named CoordinationManager, where we could > have some higher level coordination-related classes, like LeaderRunner > (Overseer), LeaderLatch (for shard leaders), etc. Curator comes into place > because, in order to refactor the existing code into these new abstractions, > we’d have to rework much of it, so we could instead consider using Curator, a > library that was mentioned in the past many times. While I don’t think this > is required, It would make this transition and our code simpler (from what I > could see, however, input from people with more Curator experience would be > greatly appreciated). > While it would be out of the scope of this change, If the > abstractions/interfaces are correctly designed, this could lead to, in the > future, be able to use something other than ZooKeeper for coordination, > either etcd or maybe even some in-memory replacement for tests. > {quote} > There are still many open questions, and many questions I still don’t know > we’ll have, but please, let me know if you have any early feedback, specially > if you’ve worked with Curator in the past. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14274) Multiple CoreContainers will register the same JVM Metrics
[ https://issues.apache.org/jira/browse/SOLR-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052427#comment-17052427 ] Mike Drob commented on SOLR-14274: -- [~ab] - gentle ping on this, would be interested to know what you think of this PR. > Multiple CoreContainers will register the same JVM Metrics > -- > > Key: SOLR-14274 > URL: https://issues.apache.org/jira/browse/SOLR-14274 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mike Drob >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When running multiple CoreContainer in the same JVM, either because we called > {{SolrCloudTestCase.configureCluster(int n)}} with {{n > 1}} or because we > have multiple tests running in the same JVM in succession, we will have > contention on the shared JVM {{metricsRegistry}} as they each replace the > existing metrics with their own. Further, with multiple nodes at the same > time, some of these metrics will be incorrect anyway, since they will only > reflect a single core container. Others will be fine since I think they are > reading system-level information so it doesn't matter where it comes from. > I think this is a test-only issue, since the circumstances where somebody is > running multiple core containers in a single JVM in production should be > rare, but maybe there are edge cases affected with EmbeddedSolrServer and > MapReduce or Spark, or other unusual deployment patterns. > Removing the metrics registration entirely can speed up > {{configureCluster(100).build()}} on my machine from 2 minutes to 30 seconds, > so I'm optimistic that there can be gains here without sacrificing the > feature entirely. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches
jpountz commented on a change in pull request #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches URL: https://github.com/apache/lucene-solr/pull/1294#discussion_r388498188 ## File path: lucene/core/src/java/org/apache/lucene/search/SliceExecutionControlPlane.java ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.List; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.Executor; +import java.util.concurrent.Future; +import java.util.concurrent.FutureTask; +import java.util.concurrent.RejectedExecutionException; + +/** + * Execution control plane which is responsible + * for execution of slices based on the current status + * of the system and current system load + */ +class SliceExecutionControlPlane { Review comment: nit: I'd prefer a simpler name, e.g. `SliceExecutor` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches
jpountz commented on a change in pull request #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches URL: https://github.com/apache/lucene-solr/pull/1294#discussion_r388497511 ## File path: lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java ## @@ -662,34 +676,19 @@ public TopFieldDocs reduce(Collection collectors) throws IOEx } query = rewrite(query); final Weight weight = createWeight(query, scoreMode, 1); - final List> topDocsFutures = new ArrayList<>(leafSlices.length); - for (int i = 0; i < leafSlices.length - 1; ++i) { + final List listTasks = new ArrayList<>(); Review comment: Let's avoid introducing warnings about generics, FutureTask needs to be parameterized? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches
jpountz commented on a change in pull request #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches URL: https://github.com/apache/lucene-solr/pull/1294#discussion_r388493819 ## File path: lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java ## @@ -211,6 +213,18 @@ public IndexSearcher(IndexReaderContext context, Executor executor) { assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel for reader" + context.reader(); reader = context.reader(); this.executor = executor; +this.sliceExecutionControlPlane = executor == null ? null : getSliceExecutionControlPlane(executor); +this.readerContext = context; +leafContexts = context.leaves(); +this.leafSlices = executor == null ? null : slices(leafContexts); + } + + // Package private for testing + IndexSearcher(IndexReaderContext context, Executor executor, SliceExecutionControlPlane sliceExecutionControlPlane) { +assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel for reader" + context.reader(); +reader = context.reader(); +this.executor = executor; +this.sliceExecutionControlPlane = executor == null ? null : sliceExecutionControlPlane; Review comment: it feels wrong to not take the one from the constructor? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches
jpountz commented on a change in pull request #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches URL: https://github.com/apache/lucene-solr/pull/1294#discussion_r388497042 ## File path: lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java ## @@ -211,6 +215,18 @@ public IndexSearcher(IndexReaderContext context, Executor executor) { assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel for reader" + context.reader(); reader = context.reader(); this.executor = executor; +this.sliceExecutionControlPlane = executor == null ? null : getSliceExecutionControlPlane(executor); +this.readerContext = context; +leafContexts = context.leaves(); +this.leafSlices = executor == null ? null : slices(leafContexts); + } + + // Package private for testing + IndexSearcher(IndexReaderContext context, Executor executor, SliceExecutionControlPlane sliceExecutionControlPlane) { Review comment: Is there anything we need to do with the executor that we couldn't do with the sliceExecutionControlPlane? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14306) Refactor coordination code into separate module and evaluate using Curator
[ https://issues.apache.org/jira/browse/SOLR-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052458#comment-17052458 ] Tomas Eduardo Fernandez Lobbe commented on SOLR-14306: -- Yes, I don't like merging the two, but I felt moving the Solr part alone could have been more difficult while at the same time, maybe not the best long term if we are talking about moving to Curator eventually. bq. I also am concerned that if we do both changes at the same time we end up with a bad abstraction that looks ok but is actually very Curator specific That's a very good point. I did a POC and it's easy to fall into this. It may or may not be a problem, if we like the interfaces to be curator-oriended, we'd have to make whatever replacement we have look like it later. bq. are the overseer and shard leader election code paths using common tools right now They are in part, yes. One thing I noticed also while looking at Curator is that those two could actually fall into different "recipes". Overseer is essentially "do some work while you are the leader. Stop doing it when you are no longer the leader" (LeaderElector in Curator), while shard leader is "act differently while you are the leader" (Leader latch in Curator). Of course they can both use the same implementation if we want (i.e. we can keep asking in the Overseer "amILeader" and then have listeners to interrupt), but I like that differentiation that Curator makes. > Refactor coordination code into separate module and evaluate using Curator > -- > > Key: SOLR-14306 > URL: https://issues.apache.org/jira/browse/SOLR-14306 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomas Eduardo Fernandez Lobbe >Priority: Major > > This Jira issue is to discuss two changes that unfortunately are difficult to > address separately > # Separate all ZooKeeper coordination logic into it’s own module, that can > be tested in isolation > # Evaluate using Apache Curator for coordination instead of our own logic. > I drafted a > [SIP|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148640472], > but this is very much WIP, I’d like to hear opinions before I spend too much > time on something people hates. > From the initial draft of the SIP: > {quote}The main goal of this change is to allow better testing of the > different ZooKeeper interactions related to coordination (leader election, > queues, etc). There are already some abstractions in place for lower level > operations (set-data, get-data, etc, see DistribStateManager), so the idea is > to have a new, related abstraction named CoordinationManager, where we could > have some higher level coordination-related classes, like LeaderRunner > (Overseer), LeaderLatch (for shard leaders), etc. Curator comes into place > because, in order to refactor the existing code into these new abstractions, > we’d have to rework much of it, so we could instead consider using Curator, a > library that was mentioned in the past many times. While I don’t think this > is required, It would make this transition and our code simpler (from what I > could see, however, input from people with more Curator experience would be > greatly appreciated). > While it would be out of the scope of this change, If the > abstractions/interfaces are correctly designed, this could lead to, in the > future, be able to use something other than ZooKeeper for coordination, > either etcd or maybe even some in-memory replacement for tests. > {quote} > There are still many open questions, and many questions I still don’t know > we’ll have, but please, let me know if you have any early feedback, specially > if you’ve worked with Curator in the past. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] andyvuong commented on issue #1293: SOLR-14044: Delete collection bug fix by changing sharedShardName to use the same blob delimiter
andyvuong commented on issue #1293: SOLR-14044: Delete collection bug fix by changing sharedShardName to use the same blob delimiter URL: https://github.com/apache/lucene-solr/pull/1293#issuecomment-595431605 cc @yonik can you merge. Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov commented on issue #1313: LUCENE-8962: Split test case
msokolov commented on issue #1313: LUCENE-8962: Split test case URL: https://github.com/apache/lucene-solr/pull/1313#issuecomment-595436417 I verified this fixes the `TestIndexWriterExceptions2.testBasics` reported by @jpountz and also beasted that test 1000x just in case. I think we need to get ahead of this given all the fail emails from these tests, and the upcoming 8.5 release, so I'll push today This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov merged pull request #1313: LUCENE-8962: Split test case
msokolov merged pull request #1313: LUCENE-8962: Split test case URL: https://github.com/apache/lucene-solr/pull/1313 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052508#comment-17052508 ] ASF subversion and git services commented on LUCENE-8962: - Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207 ] LUCENE-8962: Split test case (#1313) * LUCENE-8962: Simplify test case The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once. Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown. * LUCENE-8962: Only update toCommit if merge was committed The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 7h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052510#comment-17052510 ] ASF subversion and git services commented on LUCENE-8962: - Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207 ] LUCENE-8962: Split test case (#1313) * LUCENE-8962: Simplify test case The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once. Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown. * LUCENE-8962: Only update toCommit if merge was committed The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 7h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052512#comment-17052512 ] ASF subversion and git services commented on LUCENE-8962: - Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207 ] LUCENE-8962: Split test case (#1313) * LUCENE-8962: Simplify test case The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once. Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown. * LUCENE-8962: Only update toCommit if merge was committed The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 7h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14306) Refactor coordination code into separate module and evaluate using Curator
[ https://issues.apache.org/jira/browse/SOLR-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052518#comment-17052518 ] Marcus Eagan commented on SOLR-14306: - I have been thinking about this approach and following the Kafka discussion that Jan posted. It seems that refactoring coordination code into a separate module is a great first step for whichever direction we go in the future. > Refactor coordination code into separate module and evaluate using Curator > -- > > Key: SOLR-14306 > URL: https://issues.apache.org/jira/browse/SOLR-14306 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomas Eduardo Fernandez Lobbe >Priority: Major > > This Jira issue is to discuss two changes that unfortunately are difficult to > address separately > # Separate all ZooKeeper coordination logic into it’s own module, that can > be tested in isolation > # Evaluate using Apache Curator for coordination instead of our own logic. > I drafted a > [SIP|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148640472], > but this is very much WIP, I’d like to hear opinions before I spend too much > time on something people hates. > From the initial draft of the SIP: > {quote}The main goal of this change is to allow better testing of the > different ZooKeeper interactions related to coordination (leader election, > queues, etc). There are already some abstractions in place for lower level > operations (set-data, get-data, etc, see DistribStateManager), so the idea is > to have a new, related abstraction named CoordinationManager, where we could > have some higher level coordination-related classes, like LeaderRunner > (Overseer), LeaderLatch (for shard leaders), etc. Curator comes into place > because, in order to refactor the existing code into these new abstractions, > we’d have to rework much of it, so we could instead consider using Curator, a > library that was mentioned in the past many times. While I don’t think this > is required, It would make this transition and our code simpler (from what I > could see, however, input from people with more Curator experience would be > greatly appreciated). > While it would be out of the scope of this change, If the > abstractions/interfaces are correctly designed, this could lead to, in the > future, be able to use something other than ZooKeeper for coordination, > either etcd or maybe even some in-memory replacement for tests. > {quote} > There are still many open questions, and many questions I still don’t know > we’ll have, but please, let me know if you have any early feedback, specially > if you’ve worked with Curator in the past. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14306) Refactor coordination code into separate module and evaluate using Curator
[ https://issues.apache.org/jira/browse/SOLR-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052526#comment-17052526 ] Jan Høydahl commented on SOLR-14306: {quote}It seems that refactoring coordination code into a separate module is a great first step for whichever direction we go in the future. {quote} +1. The single biggest obstacle I sense when helping customers with SolrCloud is Zookeeper. How do we install it, how many, nodes, how to secure it, can ZK run on same nodes as Solr, can we use embedded ZK in our test environment etc. And I think ZK will be an even bigger topic when more people start deploying in k8s. So if we manage to isolate coordination and cluster state on a higher level, then offering etcd or ratis plugins in the future will be within reach. > Refactor coordination code into separate module and evaluate using Curator > -- > > Key: SOLR-14306 > URL: https://issues.apache.org/jira/browse/SOLR-14306 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomas Eduardo Fernandez Lobbe >Priority: Major > > This Jira issue is to discuss two changes that unfortunately are difficult to > address separately > # Separate all ZooKeeper coordination logic into it’s own module, that can > be tested in isolation > # Evaluate using Apache Curator for coordination instead of our own logic. > I drafted a > [SIP|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148640472], > but this is very much WIP, I’d like to hear opinions before I spend too much > time on something people hates. > From the initial draft of the SIP: > {quote}The main goal of this change is to allow better testing of the > different ZooKeeper interactions related to coordination (leader election, > queues, etc). There are already some abstractions in place for lower level > operations (set-data, get-data, etc, see DistribStateManager), so the idea is > to have a new, related abstraction named CoordinationManager, where we could > have some higher level coordination-related classes, like LeaderRunner > (Overseer), LeaderLatch (for shard leaders), etc. Curator comes into place > because, in order to refactor the existing code into these new abstractions, > we’d have to rework much of it, so we could instead consider using Curator, a > library that was mentioned in the past many times. While I don’t think this > is required, It would make this transition and our code simpler (from what I > could see, however, input from people with more Curator experience would be > greatly appreciated). > While it would be out of the scope of this change, If the > abstractions/interfaces are correctly designed, this could lead to, in the > future, be able to use something other than ZooKeeper for coordination, > either etcd or maybe even some in-memory replacement for tests. > {quote} > There are still many open questions, and many questions I still don’t know > we’ll have, but please, let me know if you have any early feedback, specially > if you’ve worked with Curator in the past. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats. URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054 **Benchmarks** In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size). sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims. ``` APPROACH RECALL QPS LuceneExact() 1.0006.425 LuceneCluster(n_probes=5) 0.749 574.186 LuceneCluster(n_probes=10)0.874 308.455 LuceneCluster(n_probes=20)0.951 116.871 LuceneCluster(n_probes=50)0.993 67.354 LuceneCluster(n_probes=100) 0.999 34.651 ``` glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims. ``` APPROACH RECALL QPS LuceneExact() 1.0006.722 LuceneCluster(n_probes=5) 0.680 618.438 LuceneCluster(n_probes=10)0.766 335.956 LuceneCluster(n_probes=20)0.835 173.782 LuceneCluster(n_probes=50)0.905 72.747 LuceneCluster(n_probes=100) 0.948 37.339 ``` These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on a more robust benchmarking set-up. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-11359) An autoscaling/suggestions endpoint to recommend operations
[ https://issues.apache.org/jira/browse/SOLR-11359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052544#comment-17052544 ] Megan Carey commented on SOLR-11359: Would it be possible to explicitly return the URL to hit for applying the suggestion? i.e. rather than return an HTTP method, operation type, etc. just return the constructed URL for executing the action? Also, are you considering writing a cron to periodically execute these suggestions? > An autoscaling/suggestions endpoint to recommend operations > --- > > Key: SOLR-11359 > URL: https://issues.apache.org/jira/browse/SOLR-11359 > Project: Solr > Issue Type: New Feature > Components: AutoScaling >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Attachments: SOLR-11359.patch > > > Autoscaling can make suggestions to users on what operations they can perform > to improve the health of the cluster > The suggestions will have the following information > * http end point > * http method (POST,DELETE) > * command payload -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14044) Support shard/collection deletion in shared storage
[ https://issues.apache.org/jira/browse/SOLR-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052545#comment-17052545 ] ASF subversion and git services commented on SOLR-14044: Commit c8c216514af29d94d3f269d01f57e1c0f2421b69 in lucene-solr's branch refs/heads/jira/SOLR-13101 from Yonik Seeley [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c8c2165 ] SOLR-14044: Delete collection bug fix by changing sharedShardName to use the same blob delimiter (#1293) * Change sharedShardName to use blob delimiter and fix test * use assign in test > Support shard/collection deletion in shared storage > --- > > Key: SOLR-14044 > URL: https://issues.apache.org/jira/browse/SOLR-14044 > Project: Solr > Issue Type: Sub-task > Components: SolrCloud >Reporter: Andy Vuong >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > The Solr Cloud deletion APIs for collections and shards are not currently > supported by shared storage but are an essential functionality required by > the shared storage design. Deletion of objects from shared storage currently > only happens in the indexing path (on pushes) and after the index file > listings between the local solr process and external store have been resolved. > > This task is to track supporting the delete shard/collection API commands and > its scope does not include cleaning up so called “orphaned” index files from > blob (i.e. files that are no longer referenced by any core.metadata file on > the external store). This will be designed/covered in another subtask. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] yonik merged pull request #1293: SOLR-14044: Delete collection bug fix by changing sharedShardName to use the same blob delimiter
yonik merged pull request #1293: SOLR-14044: Delete collection bug fix by changing sharedShardName to use the same blob delimiter URL: https://github.com/apache/lucene-solr/pull/1293 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant commented on issue #1320: LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes.
bruno-roustant commented on issue #1320: LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes. URL: https://github.com/apache/lucene-solr/pull/1320#issuecomment-595474374 Good point. I added it and classified as 'Other'. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dnhatn commented on issue #1313: LUCENE-8962: Split test case
dnhatn commented on issue #1313: LUCENE-8962: Split test case URL: https://github.com/apache/lucene-solr/pull/1313#issuecomment-595503029 @msfroh @msokolov Thank you for working on the fix. Unfortunately, this is still an issue. Many Elasticsearch tests are [failing](https://github.com/elastic/elasticsearch/issues/53195) even with this change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052607#comment-17052607 ] ASF subversion and git services commented on LUCENE-8962: - Commit e5be034df2fc22f1b88e4d271b25c8fae1c3093f in lucene-solr's branch refs/heads/master from Michael McCandless [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e5be034 ] LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs) > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 7h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052608#comment-17052608 ] ASF subversion and git services commented on LUCENE-8962: - Commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8 in lucene-solr's branch refs/heads/branch_8x from Michael McCandless [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3dbfd10 ] LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs) > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 7h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14307) "user caches" don't support "enable"
[ https://issues.apache.org/jira/browse/SOLR-14307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated SOLR-14307: -- Attachment: SOLR-14307.patch Status: Open (was: Open) patch with fix and tests > "user caches" don't support "enable" > > > Key: SOLR-14307 > URL: https://issues.apache.org/jira/browse/SOLR-14307 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Attachments: SOLR-14307.patch > > > while trying to help write some test cases for SOLR-13807 i discovered that > the code path used for building the {{List}} of _user_ caches > (ie: {{}} doesn't respect the idea of an "enabled" > attribute ... that is only checked for in the code path use for building > singular CacheConfig options from explicit xpaths (ie: {{ />}} etc...) > We should fix this, if for no other reason then so it's easy for tests to use > system properties to enable/disable all caches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14307) "user caches" don't support "enable"
[ https://issues.apache.org/jira/browse/SOLR-14307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated SOLR-14307: -- Status: Patch Available (was: Open) > "user caches" don't support "enable" > > > Key: SOLR-14307 > URL: https://issues.apache.org/jira/browse/SOLR-14307 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Attachments: SOLR-14307.patch > > > while trying to help write some test cases for SOLR-13807 i discovered that > the code path used for building the {{List}} of _user_ caches > (ie: {{}} doesn't respect the idea of an "enabled" > attribute ... that is only checked for in the code path use for building > singular CacheConfig options from explicit xpaths (ie: {{ />}} etc...) > We should fix this, if for no other reason then so it's easy for tests to use > system properties to enable/disable all caches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14307) "user caches" don't support "enabled" attribute
[ https://issues.apache.org/jira/browse/SOLR-14307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated SOLR-14307: -- Summary: "user caches" don't support "enabled" attribute (was: "user caches" don't support "enable") > "user caches" don't support "enabled" attribute > --- > > Key: SOLR-14307 > URL: https://issues.apache.org/jira/browse/SOLR-14307 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Attachments: SOLR-14307.patch > > > while trying to help write some test cases for SOLR-13807 i discovered that > the code path used for building the {{List}} of _user_ caches > (ie: {{}} doesn't respect the idea of an "enabled" > attribute ... that is only checked for in the code path use for building > singular CacheConfig options from explicit xpaths (ie: {{ />}} etc...) > We should fix this, if for no other reason then so it's easy for tests to use > system properties to enable/disable all caches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052614#comment-17052614 ] Julie Tibshirani commented on LUCENE-9136: -- Hello [~tomoko]! My explanation before was way too brief, I'm still getting used to the joint JIRA/ GitHub set-up :) I'll give more context on the suggested direction. The draft adds a new format VectorsFormat, which simply delegates to DocValuesFormat and PostingsFormat under the hood: * The original vectors are stored as BinaryDocValues. * The vectors are also clustered through k-means clustering, and the cluster information is stored in postings format. In particular, each cluster centroid is encoded to a BytesRef to represent a term. Each document belonging to the centroid is added to the postings list for that term. Given a query vector, we first iterate through all the centroid terms to find a small number of closest centroids. We then take the disjunction of all those postings enums to obtain a DocIdSetIterator of candidate nearest neighbors. To produce the score for each candidate, we load its vector from BinaryDocValues and compute the distance to the query vector. I liked that this approach didn't introduce major new data structures and could re-use the existing formats. To respond to your point, one difference between this approach and HNSW is that it’s able to re-use the formats without modifications to their APIs or implementations. In particular, it doesn’t require random access for doc values, they are only accessed through forward iteration. So to keep the code as simple as possible, I stuck with BinaryDocValues and didn’t create a new way to store the vector values. However, the PR does introduce a new top-level VectorsFormat as I thought this gave nice flexibility while prototyping. There are two main hacks in the draft that would need addressing: * It's fairly fragile to re-use formats explicitly since we write to the same files as normal doc values and postings – I think there would be a conflict if there were both a vector field and a doc values field with the same name. * To write the postings list, we compute the map from centroid to documents in memory. We then expose it through a hacky Fields implementation called ClusterBackedFields and pass it to the postings writer. It would be better to avoid this hack and not to compute cluster information using a map. Even apart from code-level concerns, I don't think the draft PR would be ready to integrate immediately. There are some areas where I think further work is needed to determine if coarse quantization (IVFFlat) is the right approach: * It would be good to run tests to understand how it scales to larger sets of documents, say in the 5M - 100M range. We would probably want to scale the number of centroids with the number of documents – a common heuristic is to set num centroids = sqrt(dataset size). Looking at the [FAISS experiments|https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors], it can be helpful to use an even higher number of centroids. ** Do we still obtain good recall and QPS for these larger dataset sizes? ** Can we still afford to run k-means at index time, given a larger number of centroids? With 10,000 centroids for example, each time we index a document we’ll be computing the distance between the document and 10,000 other vectors. This is a big concern and I think we would need strategies to address it. * It’s great that coarse quantization is relatively simple and could be implemented with existing data structures. But would we expect a much bigger speed-up and better scaling with a graph approach like HNSW? I think this still requires more analysis. * More thinking is required as to how to handle deleted documents (as discussed in LUCENE-9004). > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, > image-2020-02-16-15-05-02-451.png > > Time Spent: 50m > Remaining Estimate: 0h > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in indus
[jira] [Commented] (LUCENE-9241) fix most memory-hungry tests
[ https://issues.apache.org/jira/browse/LUCENE-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052668#comment-17052668 ] Robert Muir commented on LUCENE-9241: - [~dweiss] I saw a recent URLclassloader windows leak thread on the jdk list and it reminded me of this issue. I'll remove the use of getResource (*please keep in mind there are many of these elsewhere in the codebase if you are actually concerned about this*). Instead, if the user screws up here in their test, they'll get a NullPointerException and they can follow the stack trace. Soon the default NPE from the JDK will actually be more helpful than such custom messages like this anyway. > fix most memory-hungry tests > > > Key: LUCENE-9241 > URL: https://issues.apache.org/jira/browse/LUCENE-9241 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9241.patch > > > Currently each test jvm has Xmx of 512M. With a modern macbook pro this is > 4GB which is pretty crazy. > On the other hand, if we fix a few edge cases, tests can work with lower > heaps such as 128M. This can save many gigabytes (also it finds interesting > memory waste/issues). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9241) fix most memory-hungry tests
[ https://issues.apache.org/jira/browse/LUCENE-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052697#comment-17052697 ] ASF subversion and git services commented on LUCENE-9241: - Commit 9cfdf17b2895866877668002d443277a46cd04e8 in lucene-solr's branch refs/heads/master from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9cfdf17 ] LUCENE-9241: fix tests to pass with -Xmx128m > fix most memory-hungry tests > > > Key: LUCENE-9241 > URL: https://issues.apache.org/jira/browse/LUCENE-9241 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9241.patch > > > Currently each test jvm has Xmx of 512M. With a modern macbook pro this is > 4GB which is pretty crazy. > On the other hand, if we fix a few edge cases, tests can work with lower > heaps such as 128M. This can save many gigabytes (also it finds interesting > memory waste/issues). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052727#comment-17052727 ] Xin-Chun Zhang commented on LUCENE-9136: Hi, [~jtibshirani], thanks for you excellent work! ??I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations.?? Yes, the codes could be simple by reusing these formats. But I agree with [~tomoko] that ANN search is a pretty new feature to Lucene, it's better to use a dedicated format for maintaining reasons. Moreover, If we are going to use a dedicated vector format for HNSW, this could also applied to IVFFlat because IVFFlat and HNSW are used for the same purpose of ANN search. It may be strange to users if IVFFlat and HNSW perform completely different. ??In particular, it doesn’t require random access for doc values, they are only accessed through forward iteration.?? Actually, we need random access to the vector values! For a typical search engine, we are going to retrieving the best matched documents after obtaining the TopK docIDs. Retrieving vectors via these docIDs requires random access to the vector values. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, > image-2020-02-16-15-05-02-451.png > > Time Spent: 50m > Remaining Estimate: 0h > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization based algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > IVFFlat is better for high-precision applications such as face recognition, > while HNSW performs better in general scenarios including recommendation and > personalized advertisement. *The recall ratio of IVFFlat could be gradually > increased by adjusting the query parameter (nprobe), while it's hard for HNSW > to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. > Recently, the implementation of HNSW (Hierarchical Navigable Small World, > LUCENE-9004) for Lucene, has made great progress. The issue draws attention > of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. > As an alternative for solving ANN similarity search problems, IVFFlat is also > very popular with many users and supporters. Compared with HNSW, IVFFlat has > smaller index size but requires k-means clustering, while HNSW is faster in > query (no training required) but requires extra storage for saving graphs > [indexing 1M > vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. > Another advantage is that IVFFlat can be faster and more accurate when > enables GPU parallel computing (current not support in Java). Both algorithms > have their merits and demerits. Since HNSW is now under development, it may > be better to provide both implementations (HNSW && IVFFlat) for potential > users who are faced with very different scenarios and want to more choices. > The latest branch is > [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052727#comment-17052727 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 3/6/20, 3:34 AM: - Hi, [~jtibshirani], thanks for you excellent work! ??I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations.?? Yes, the codes could be simple by reusing these formats. But I agree with [~tomoko] that ANN search is a pretty new feature to Lucene, it's better to use a dedicated format for maintaining reasons. Moreover, If we are going to use a dedicated vector format for HNSW, this format should also be applied to IVFFlat because IVFFlat and HNSW are used for the same purpose of ANN search. It may be strange to users if IVFFlat and HNSW perform completely different. ??In particular, it doesn’t require random access for doc values, they are only accessed through forward iteration.?? Actually, we need random access to the vector values! For a typical search engine, we are going to retrieving the best matched documents after obtaining the TopK docIDs. Retrieving vectors via these docIDs requires random access to the vector values. was (Author: irvingzhang): Hi, [~jtibshirani], thanks for you excellent work! ??I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations.?? Yes, the codes could be simple by reusing these formats. But I agree with [~tomoko] that ANN search is a pretty new feature to Lucene, it's better to use a dedicated format for maintaining reasons. Moreover, If we are going to use a dedicated vector format for HNSW, this could also applied to IVFFlat because IVFFlat and HNSW are used for the same purpose of ANN search. It may be strange to users if IVFFlat and HNSW perform completely different. ??In particular, it doesn’t require random access for doc values, they are only accessed through forward iteration.?? Actually, we need random access to the vector values! For a typical search engine, we are going to retrieving the best matched documents after obtaining the TopK docIDs. Retrieving vectors via these docIDs requires random access to the vector values. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, > image-2020-02-16-15-05-02-451.png > > Time Spent: 50m > Remaining Estimate: 0h > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization based algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > IVFFlat is better for high-precision applications such as face recognition, > while HNSW performs better in general scenarios including recommendation and > personalized advertisement. *The recall ratio of IVFFlat could be gradually > increased by adjusting the query parameter (nprobe), while it's hard for HNSW > to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. > Recently, the implementation of HNSW (Hierarchical Navigable Small World, > LUCENE-9004) for Lucene, has made great progress. The issue draws attention > of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. > As an alternative for solving ANN similarity search pro
[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit
dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r388704872 ## File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java ## @@ -3147,6 +3149,42 @@ public final boolean flushNextBuffer() throws IOException { } } + private MergePolicy.OneMerge updateSegmentInfosOnMergeFinish(MergePolicy.OneMerge merge, final SegmentInfos toCommit, + AtomicReference mergeLatchRef) { +return new MergePolicy.OneMerge(merge.segments) { + public void mergeFinished() throws IOException { +super.mergeFinished(); +CountDownLatch mergeAwaitLatch = mergeLatchRef.get(); +if (mergeAwaitLatch == null) { + // Commit thread timed out waiting for this merge and moved on. No need to manipulate toCommit. + return; +} +if (isAborted() == false) { + deleter.incRef(this.info.files()); + // Resolve "live" SegmentInfos segments to their toCommit cloned equivalents, based on segment name. + Set mergedSegmentNames = new HashSet<>(); + for (SegmentCommitInfo sci : this.segments) { +deleter.decRef(sci.files()); +mergedSegmentNames.add(sci.info.name); + } + List toCommitMergedAwaySegments = new ArrayList<>(); + for (SegmentCommitInfo sci : toCommit) { +if (mergedSegmentNames.contains(sci.info.name)) { + toCommitMergedAwaySegments.add(sci); +} + } + // Construct a OneMerge that applies to toCommit + MergePolicy.OneMerge applicableMerge = new MergePolicy.OneMerge(toCommitMergedAwaySegments); + applicableMerge.info = this.info.clone(); + long segmentCounter = Long.parseLong(this.info.info.name.substring(1), Character.MAX_RADIX); + toCommit.counter = Math.max(toCommit.counter, segmentCounter + 1); + toCommit.applyMergeChanges(applicableMerge, false); Review comment: We should modify `toCommit` under `IndexWrite.this` lock (or a private synchronization between this method and `commitInternal`). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit
dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r388705514 ## File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java ## @@ -3147,6 +3149,42 @@ public final boolean flushNextBuffer() throws IOException { } } + private MergePolicy.OneMerge updateSegmentInfosOnMergeFinish(MergePolicy.OneMerge merge, final SegmentInfos toCommit, + AtomicReference mergeLatchRef) { +return new MergePolicy.OneMerge(merge.segments) { + public void mergeFinished() throws IOException { +super.mergeFinished(); +CountDownLatch mergeAwaitLatch = mergeLatchRef.get(); +if (mergeAwaitLatch == null) { + // Commit thread timed out waiting for this merge and moved on. No need to manipulate toCommit. Review comment: We need a stronger synchronization to make sure that we won't modify `toCommit` if `commitInternal` has stopped waiting for these merges. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit
dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r388705156 ## File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java ## @@ -3252,6 +3315,53 @@ private long prepareCommitInternal() throws IOException { } finally { maybeCloseOnTragicEvent(); } + + if (mergeAwaitLatchRef != null) { +CountDownLatch mergeAwaitLatch = mergeAwaitLatchRef.get(); +// If we found and registered any merges above, within the flushLock, then we want to ensure that they +// complete execution. Note that since we released the lock, other merges may have been scheduled. We will +// block until the merges that we registered complete. As they complete, they will update toCommit to +// replace merged segments with the result of each merge. +config.getIndexWriterEvents().beginMergeOnCommit(); +mergeScheduler.merge(this, MergeTrigger.COMMIT, true); +long mergeWaitStart = System.nanoTime(); +int abandonedCount = 0; +long waitTimeMillis = (long) (config.getMaxCommitMergeWaitSeconds() * 1000.0); +try { + if (mergeAwaitLatch.await(waitTimeMillis, TimeUnit.MILLISECONDS) == false) { +synchronized (this) { + // Need to do this in a synchronized block, to make sure none of our commit merges are currently + // executing mergeFinished (since mergeFinished itself is called from within the IndexWriter lock). + // After we clear the value from mergeAwaitLatchRef, the merges we schedule will still execute as + // usual, but when they finish, they won't attempt to update toCommit or modify segment reference + // counts. + mergeAwaitLatchRef.set(null); Review comment: I think we should set `mergeAwaitLatchRef` in the `else` branch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14307) "user caches" don't support "enabled" attribute
[ https://issues.apache.org/jira/browse/SOLR-14307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052757#comment-17052757 ] Lucene/Solr QA commented on SOLR-14307: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 20s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 76m 40s{color} | {color:green} core in the patch passed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 84m 14s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | SOLR-14307 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12995795/SOLR-14307.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns | | uname | Linux lucene2-us-west.apache.org 4.4.0-170-generic #199-Ubuntu SMP Thu Nov 14 01:45:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / 9cfdf17 | | ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 | | Default Java | LTS | | Test Results | https://builds.apache.org/job/PreCommit-SOLR-Build/698/testReport/ | | modules | C: solr/core U: solr/core | | Console output | https://builds.apache.org/job/PreCommit-SOLR-Build/698/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > "user caches" don't support "enabled" attribute > --- > > Key: SOLR-14307 > URL: https://issues.apache.org/jira/browse/SOLR-14307 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Attachments: SOLR-14307.patch > > > while trying to help write some test cases for SOLR-13807 i discovered that > the code path used for building the {{List}} of _user_ caches > (ie: {{}} doesn't respect the idea of an "enabled" > attribute ... that is only checked for in the code path use for building > singular CacheConfig options from explicit xpaths (ie: {{ />}} etc...) > We should fix this, if for no other reason then so it's easy for tests to use > system properties to enable/disable all caches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dnhatn commented on issue #1313: LUCENE-8962: Split test case
dnhatn commented on issue #1313: LUCENE-8962: Split test case URL: https://github.com/apache/lucene-solr/pull/1313#issuecomment-595597021 I've left some comments in https://github.com/apache/lucene-solr/pull/1155. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dnhatn commented on issue #1319: LUCENE-9164: process all events before closing gracefully
dnhatn commented on issue #1319: LUCENE-9164: process all events before closing gracefully URL: https://github.com/apache/lucene-solr/pull/1319#issuecomment-595598568 Thanks, Simon. I will take a look at this tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053036#comment-17053036 ] David Smiley commented on SOLR-14040: - > But it affected a small subset of users. Now, we have implemented it for > cloud, it can potentially affect a vast majority of users (if they use it ) . Because it's opt-in and it's has been a secret feature still... sorry I just don't see the severity that you see. Any way, how exactly would you propose dealing with this in the immediate term -- for 8.5? I don't think you mean to revert the change in this commit because the feature remains for standalone -- and hence I think we're having the discussion on the wrong issue; should be the linked SOLR-14232. > solr.xml shareSchema does not work in SolrCloud > --- > > Key: SOLR-14040 > URL: https://issues.apache.org/jira/browse/SOLR-14040 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Blocker > Fix For: 8.5 > > Time Spent: 0.5h > Remaining Estimate: 0h > > solr.xml has a shareSchema boolean option that can be toggled from the > default of false to true in order to share IndexSchema objects within the > Solr node. This is silently ignored in SolrCloud mode. The pertinent code > is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which > creates a CloudConfigSetService that is not related to the SchemaCaching > class. This may not be a big deal in SolrCloud which tends not to deal well > with many cores per node but I'm working on changing that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14232) Add shareSchema leak protections
[ https://issues.apache.org/jira/browse/SOLR-14232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053037#comment-17053037 ] Noble Paul commented on SOLR-14232: --- let's say we have classes shared between your solrconfig and schema core 1 is created with SRL1 . solrconfig uses SRL1, schema uses SRL1 (all good) core2 is created with SRL2 . solr uses SRL2, schmea uses SRL1 . If schema/solrconfig shares an object of say ClassX this can lead to ClassCastException it's avoidable if schema & solrconfig has no shared classes. Or even if you share they don't get passed around. If you use it internally in your org, it can be avoided if you are careful. We cannot have a public feature that can lead to such a bug. > Add shareSchema leak protections > > > Key: SOLR-14232 > URL: https://issues.apache.org/jira/browse/SOLR-14232 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: David Smiley >Priority: Major > > The shareSchema option in solr.xml allows cores to share a common > IndexSchema, assuming the underlying schema is literally the same (from the > same configSet). However this sharing has no protections to prevent an > IndexSchema from accidentally referencing the SolrCore and its settings. The > effect might be nondeterministic behavior depending on which core loaded the > schema first, or the effect might be a memory leak preventing a closed > SolrCore from GC'ing, or maybe an error. Example: > * IndexSchema could theoretically do property expansion using the core's > props, such as solr.core.name, silly as that may be. > * IndexSchema uses the same SolrResourceLoader for the core, which in turn > tracks infoMBeans and other things that can refer to the core. It should > probably have it's own SolrResourceLoader but it's not trivial; there are > complications with life-cycle of ResourceLoaderAware tracking etc. > * If anything in IndexSchema is SolrCoreAware, this isn't going to work! > ** SchemaSimilarityFactory is SolrCoreAware, though I think it could be > reduced to being SchemaAware and work. > ** ExternalFileField is currently SchemaAware it grabs the > SolrResourceLoader to call getDataDir which is bad. FYI In a separate PR I'm > removing getDataDir from SRL. > ** Should probably fail if anything is detected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053038#comment-17053038 ] Noble Paul commented on SOLR-14040: --- Please document this in the ref guide and we can unblock this. our users end up using undocumented features > solr.xml shareSchema does not work in SolrCloud > --- > > Key: SOLR-14040 > URL: https://issues.apache.org/jira/browse/SOLR-14040 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Blocker > Fix For: 8.5 > > Time Spent: 0.5h > Remaining Estimate: 0h > > solr.xml has a shareSchema boolean option that can be toggled from the > default of false to true in order to share IndexSchema objects within the > Solr node. This is silently ignored in SolrCloud mode. The pertinent code > is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which > creates a CloudConfigSetService that is not related to the SchemaCaching > class. This may not be a big deal in SolrCloud which tends not to deal well > with many cores per node but I'm working on changing that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )
[ https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053039#comment-17053039 ] David Smiley commented on SOLR-13749: - [~romseygeek] (8.5 RM) in this issue I'm proposing we expose the committed feature differently, but don't have time to do it and so I'm proposing we temporarily un-document it until we expose the feature in a sustainable way as opposed to having a back-compat concern. If need be I'll do this un-document commit. > Implement support for joining across collections with multiple shards ( XCJF ) > -- > > Key: SOLR-13749 > URL: https://issues.apache.org/jira/browse/SOLR-13749 > Project: Solr > Issue Type: New Feature >Reporter: Kevin Watters >Assignee: Gus Heck >Priority: Major > Fix For: 8.5 > > Time Spent: 1.5h > Remaining Estimate: 0h > > This ticket includes 2 query parsers. > The first one is the "Cross collection join filter" (XCJF) parser. This is > the "Cross-collection join filter" query parser. It can do a call out to a > remote collection to get a set of join keys to be used as a filter against > the local collection. > The second one is the Hash Range query parser that you can specify a field > name and a hash range, the result is that only the documents that would have > hashed to that range will be returned. > This query parser will do an intersection based on join keys between 2 > collections. > The local collection is the collection that you are searching against. > The remote collection is the collection that contains the join keys that you > want to use as a filter. > Each shard participating in the distributed request will execute a query > against the remote collection. If the local collection is setup with the > compositeId router to be routed on the join key field, a hash range query is > applied to the remote collection query to only match the documents that > contain a potential match for the documents that are in the local shard/core. > > > Here's some vocab to help with the descriptions of the various parameters. > ||Term||Description|| > |Local Collection|This is the main collection that is being queried.| > |Remote Collection|This is the collection that the XCJFQuery will query to > resolve the join keys.| > |XCJFQuery|The lucene query that executes a search to get back a set of join > keys from a remote collection| > |HashRangeQuery|The lucene query that matches only the documents whose hash > code on a field falls within a specified range.| > > > ||Param ||Required ||Description|| > |collection|Required|The name of the external Solr collection to be queried > to retrieve the set of join key values ( required )| > |zkHost|Optional|The connection string to be used to connect to Zookeeper. > zkHost and solrUrl are both optional parameters, and at most one of them > should be specified. > If neither of zkHost or solrUrl are specified, the local Zookeeper cluster > will be used. ( optional )| > |solrUrl|Optional|The URL of the external Solr node to be queried ( optional > )| > |from|Required|The join key field name in the external collection ( required > )| > |to|Required|The join key field name in the local collection| > |v|See Note|The query to be executed against the external Solr collection to > retrieve the set of join key values. > Note: The original query can be passed at the end of the string or as the > "v" parameter. > It's recommended to use query parameter substitution with the "v" parameter > to ensure no issues arise with the default query parsers.| > |routed| |true / false. If true, the XCJF query will use each shard's hash > range to determine the set of join keys to retrieve for that shard. > This parameter improves the performance of the cross-collection join, but > it depends on the local collection being routed by the toField. If this > parameter is not specified, > the XCJF query will try to determine the correct value automatically.| > |ttl| |The length of time that an XCJF query in the cache will be considered > valid, in seconds. Defaults to 3600 (one hour). > The XCJF query will not be aware of changes to the remote collection, so > if the remote collection is updated, cached XCJF queries may give inaccurate > results. > After the ttl period has expired, the XCJF query will re-execute the join > against the remote collection.| > |_All others_| |Any normal Solr parameter can also be specified as a local > param.| > > Example Solr Config.xml changes: > > {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}} > {{ }}{{class}}{{=}}{{"solr.LRUCache"}} > {{ }}{{size}}{{=}}{{"128"}} > {{ }}{{initialSize}}{{=}}{{"0"}} > {{ }}{{regenerator}}{{=}}{{"solr.No
[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit
dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r388705156 ## File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java ## @@ -3252,6 +3315,53 @@ private long prepareCommitInternal() throws IOException { } finally { maybeCloseOnTragicEvent(); } + + if (mergeAwaitLatchRef != null) { +CountDownLatch mergeAwaitLatch = mergeAwaitLatchRef.get(); +// If we found and registered any merges above, within the flushLock, then we want to ensure that they +// complete execution. Note that since we released the lock, other merges may have been scheduled. We will +// block until the merges that we registered complete. As they complete, they will update toCommit to +// replace merged segments with the result of each merge. +config.getIndexWriterEvents().beginMergeOnCommit(); +mergeScheduler.merge(this, MergeTrigger.COMMIT, true); +long mergeWaitStart = System.nanoTime(); +int abandonedCount = 0; +long waitTimeMillis = (long) (config.getMaxCommitMergeWaitSeconds() * 1000.0); +try { + if (mergeAwaitLatch.await(waitTimeMillis, TimeUnit.MILLISECONDS) == false) { +synchronized (this) { + // Need to do this in a synchronized block, to make sure none of our commit merges are currently + // executing mergeFinished (since mergeFinished itself is called from within the IndexWriter lock). + // After we clear the value from mergeAwaitLatchRef, the merges we schedule will still execute as + // usual, but when they finish, they won't attempt to update toCommit or modify segment reference + // counts. + mergeAwaitLatchRef.set(null); Review comment: ~I think we should set `mergeAwaitLatchRef` in the `else` branch.~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit
dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r388717615 ## File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java ## @@ -3252,6 +3315,53 @@ private long prepareCommitInternal() throws IOException { } finally { maybeCloseOnTragicEvent(); } + + if (mergeAwaitLatchRef != null) { +CountDownLatch mergeAwaitLatch = mergeAwaitLatchRef.get(); +// If we found and registered any merges above, within the flushLock, then we want to ensure that they +// complete execution. Note that since we released the lock, other merges may have been scheduled. We will +// block until the merges that we registered complete. As they complete, they will update toCommit to +// replace merged segments with the result of each merge. +config.getIndexWriterEvents().beginMergeOnCommit(); +mergeScheduler.merge(this, MergeTrigger.COMMIT, true); +long mergeWaitStart = System.nanoTime(); +int abandonedCount = 0; +long waitTimeMillis = (long) (config.getMaxCommitMergeWaitSeconds() * 1000.0); +try { + if (mergeAwaitLatch.await(waitTimeMillis, TimeUnit.MILLISECONDS) == false) { +synchronized (this) { + // Need to do this in a synchronized block, to make sure none of our commit merges are currently + // executing mergeFinished (since mergeFinished itself is called from within the IndexWriter lock). + // After we clear the value from mergeAwaitLatchRef, the merges we schedule will still execute as + // usual, but when they finish, they won't attempt to update toCommit or modify segment reference + // counts. + mergeAwaitLatchRef.set(null); Review comment: Sorry I misread this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dnhatn commented on issue #1155: LUCENE-8962: Add ability to selectively merge on commit
dnhatn commented on issue #1155: LUCENE-8962: Add ability to selectively merge on commit URL: https://github.com/apache/lucene-solr/pull/1155#issuecomment-595607002 Hmm, I missed the fact that `mergeFinished` is executed under IndexWriter lock. I will dig into this again. Please ignore my previous comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit
dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r388705514 ## File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java ## @@ -3147,6 +3149,42 @@ public final boolean flushNextBuffer() throws IOException { } } + private MergePolicy.OneMerge updateSegmentInfosOnMergeFinish(MergePolicy.OneMerge merge, final SegmentInfos toCommit, + AtomicReference mergeLatchRef) { +return new MergePolicy.OneMerge(merge.segments) { + public void mergeFinished() throws IOException { +super.mergeFinished(); +CountDownLatch mergeAwaitLatch = mergeLatchRef.get(); +if (mergeAwaitLatch == null) { + // Commit thread timed out waiting for this merge and moved on. No need to manipulate toCommit. Review comment: ~We need a stronger synchronization to make sure that we won't modify `toCommit` if `commitInternal` has stopped waiting for these merges.~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit
dnhatn commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r388704872 ## File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java ## @@ -3147,6 +3149,42 @@ public final boolean flushNextBuffer() throws IOException { } } + private MergePolicy.OneMerge updateSegmentInfosOnMergeFinish(MergePolicy.OneMerge merge, final SegmentInfos toCommit, + AtomicReference mergeLatchRef) { +return new MergePolicy.OneMerge(merge.segments) { + public void mergeFinished() throws IOException { +super.mergeFinished(); +CountDownLatch mergeAwaitLatch = mergeLatchRef.get(); +if (mergeAwaitLatch == null) { + // Commit thread timed out waiting for this merge and moved on. No need to manipulate toCommit. + return; +} +if (isAborted() == false) { + deleter.incRef(this.info.files()); + // Resolve "live" SegmentInfos segments to their toCommit cloned equivalents, based on segment name. + Set mergedSegmentNames = new HashSet<>(); + for (SegmentCommitInfo sci : this.segments) { +deleter.decRef(sci.files()); +mergedSegmentNames.add(sci.info.name); + } + List toCommitMergedAwaySegments = new ArrayList<>(); + for (SegmentCommitInfo sci : toCommit) { +if (mergedSegmentNames.contains(sci.info.name)) { + toCommitMergedAwaySegments.add(sci); +} + } + // Construct a OneMerge that applies to toCommit + MergePolicy.OneMerge applicableMerge = new MergePolicy.OneMerge(toCommitMergedAwaySegments); + applicableMerge.info = this.info.clone(); + long segmentCounter = Long.parseLong(this.info.info.name.substring(1), Character.MAX_RADIX); + toCommit.counter = Math.max(toCommit.counter, segmentCounter + 1); + toCommit.applyMergeChanges(applicableMerge, false); Review comment: ~We should modify `toCommit` under `IndexWrite.this` lock (or a private synchronization between this method and `commitInternal`).~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data
[ https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053047#comment-17053047 ] David Smiley commented on SOLR-13942: - FWIW I really like Shalin's input, and his option #3 which I'll copy-paste here: bq. Deprecate /admin/zookeeper, introduce a clean API, migrate UI to this new endpoint or a better alternative and remove /admin/zookeeper in 9.0 > /api/cluster/zk/* to fetch raw ZK data > -- > > Key: SOLR-13942 > URL: https://issues.apache.org/jira/browse/SOLR-13942 > Project: Solr > Issue Type: New Feature > Components: v2 API >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > example > download the {{state.json}} of > {code} > GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json > {code} > get a list of all children under {{/live_nodes}} > {code} > GET http://localhost:8983/api/cluster/zk/live_nodes > {code} > If the requested path is a node with children show the list of child nodes > and their meta data -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org