[GitHub] [lucene] msokolov commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections
msokolov commented on code in PR #11860: URL: https://github.com/apache/lucene/pull/11860#discussion_r1027282652 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/ExpandingVectorValues.java: ## @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.backward_codecs.lucene94; + +import java.io.IOException; +import org.apache.lucene.index.FilterVectorValues; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.util.BytesRef; + +/** reads from byte-encoded data */ +public class ExpandingVectorValues extends FilterVectorValues { + + private final float[] value; + + /** + * Constructs ExpandingVectorValues with passed byte encoded VectorValues iterator + * + * @param in the wrapped values + */ + protected ExpandingVectorValues(VectorValues in) { Review Comment: Makes sense to me. The caller can always do the conversion if they like. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
kotman12 commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321155887 Interesting .. this synchronization might not actually be doing anything after all. But I still think there are some thread-unsafe calls here regarding the integration as a whole which make me a bit paranoid. For instance, there seems to be some code paths that which interact with the singleton factory of the `POSModel`. The `POSModel` itself is a singleton that is stored in this static `posTaggerModels` map. If you follow ``` OpenNLPPOSFilterFactory.create -> NLPPOSTaggerOp.getPOSTagger -> NLPPOSTaggerOp::new -> POSTaggerME::new -> POSTaggerFactory.getPOSContextGenerator -> ConfigurablePOSContextGenerator::new -> POSTaggerFactory.createFeatureGenerators ``` you will see it end up here https://github.com/apache/opennlp/blob/6c9e604e74c941935a3c11f9ba91a7c0f8e1ec4c/opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerFactory.java#L149 which at first glance appears like a thread-unsafe lazy initialization. I also remember there being other examples. Having said all that, I can't immediately say whether or not the methods you have unsynchronized actually needed to be that way and am scratching my head why they were synched in the first place. But some of these kinds of issues are tricky to pin down even with tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
kotman12 commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321159282 > See background on `java-user@`: https://lists.apache.org/thread/okm7c49r7rv53z7v392v2v4pbv6m1pnw > > The `BaseTokenStreamTestCase.checkRandomData` has proven itself effective at finding thread-safety bugs (e.g. deep in ICU and commons-codec and other 3rd party libraries). I beasted the tests a few times with nightly and multiplier and didn't have issues. Does this library also check for race conditions that can arise between and ResourceLoaderAware::inform vs TokenStream creation and processing? I know it may be out of the scope of this change but I would be curious to know.. Open NLP has these heavy-weight global resources and I am worried as to how they get initialized. There is an intractable amount code paths to analyze once you take into account all of the configuration possibilities of OpenNLP which seems to be partially based on the binary model file you provide. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
rmuir commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321166306 I think there is no contention today due to the way the factories work. There is only one thread/tokenstream for *Op. But I am happy to add the random tests and remove the code smell? It may also confuse profilers which is what is really needed here to see where the performance issues are. Some things in opennlp are definitely hurting performance, such as only supporting String in their apis. This will create gazillions of objects. But I doubt it's the biggest issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
rmuir commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321167684 > Does this library also check for race conditions that can arise between ResourceLoaderAware::inform vs TokenStream creation and processing? I know it may be out of the scope of this change but I would be curious to know.. Open NLP has these heavy-weight global resources and I am worried as to how they get initialized. There is an intractable amount code paths to analyze once you take into account all of the configuration possibilities of OpenNLP which seems to be partially based on the binary model file you provide. You can see the code here to the current checker: https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java#L949-L951 Basically, we analyze content single-threaded and then spin up threads with a "starting gun" and ensure they create the same result as the single-thread to discover races. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
rmuir commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321170185 > Does this library also check for race conditions that can arise between ResourceLoaderAware::inform vs TokenStream creation and processing? I know it may be out of the scope of this change but I would be curious to know.. Specific to this, I think one potential plan: we could refactor the tests more to check for it. Existing tests are using `BaseTokenStreamTestCase` but we could also test factories with https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamFactoryTestCase.java And maybe we could add evil stuff to this `BaseTokenStreamFactoryTestCase` to root out any factory-specific thread hazards across all of our factories (including opennlp). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
rmuir commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321171664 and looking more, I think the answer is "no there isn't any race". Calling inform() is basically like part of the initialization of the thing. The factory isn't "done" being created until after this happens. It is silly that it is a separate method but that's just legacy cruft. You can look at CustomAnalyzer source code to see what I mean. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
rmuir commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321172760 As far as better stress-testing factories in its base class, we already stress-test using multiple tokenstreams (threads) across a single factory here. We could add stuff to the BaseTokenStreamFactoryTestCase to create N factories of the same type, just for more paranoia, in case there was some race with shared resources across multiple instances of the same factory. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
kotman12 commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321174402 I will take a look at the test locally, thanks. > There is only one thread/tokenstream for *Op. I think we are talking about different things. The mutable members within *Op are not my concern, although I could see how you would think that because at first glance that is what those synch methods may have been guarding. I am more worried about the lazy initialization of members at the Factory level which appear to be singletons and are called from Filter::create, which presumably happens on multiple threads at once (and even today is unsynchronized). Please correct me if my thinking is incorrect but when I saw that I started to question the open-nlp library at a deeper level. I'll reiterate that I see no immediate problem with this change since I can't prove that these synchronized blocks were even doing anything. But I am questioning the thread safety of the integration as it currently stands, irrespective of this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
kotman12 commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321176867 > I will take a look at the test locally, thanks. > > > There is only one thread/tokenstream for *Op. > > I think we are talking about different things. The mutable members within *Op are not my concern, although I could see how you would think that because at first glance that is what those synch methods may have been guarding. I am more worried about the lazy initialization of members at the **NLP** Factory level which appear to be singletons and are indirectly called from lucene's `FilterFactory::create`, which presumably happens on multiple threads at once (and even today is unsynchronized). Please c`rrect me if my thinking is incorrect but when I saw that I started to question the open-nlp library at a deeper level. > > I'll reiterate that I see no immediate problem with this change since I can't prove that these synchronized blocks were even doing anything. But I am questioning the thread safety of the integration as it currently stands, irrespective of this change. I might raise a separate issue to not confuse this thread anymore... what I am talking about is really only tangentially related to this current PR but in my mind it puts into question the thread safety of the integration as it stands now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
kotman12 commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321181537 > As far as better stress-testing factories in its base class, we already stress-test using multiple tokenstreams (threads) across a single factory here. > > We could add stuff to the BaseTokenStreamFactoryTestCase to create N factories of the same type, just for more paranoia, in case there was some race with shared resources across multiple instances of the same factory. Yes this is exactly my paranoia 😃 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
rmuir commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321189244 ok, good, I think we are on the same page. sorry, i dont know anything about opennlp and I'm not familiar with their codebase at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)
rmuir commented on PR #11955: URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321193321 and i don't mind doing the work here on the testing honestly. or we can do it in another PR. basically I think we should engage the BaseAnalysisFactoryTestCase with simple basic tests like the other factories have. that's a start. today the factories are not really tested directly. Usually its best to at least check different arguments/parameters and exceptional cases, e.g. https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TestLimitTokenOffsetFilterFactory.java then we can look at if base test class needs any beefing up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss closed issue #10709: gradle precommit sometimes fails with "IOException: stream closed" from javadoc in nightly benchmarks [LUCENE-9670]
dweiss closed issue #10709: gradle precommit sometimes fails with "IOException: stream closed" from javadoc in nightly benchmarks [LUCENE-9670] URL: https://github.com/apache/lucene/issues/10709 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #10236: look into adding -XX:ActiveProcessorCount=1 to parallel build jvms [LUCENE-9196]
dweiss commented on issue #10236: URL: https://github.com/apache/lucene/issues/10236#issuecomment-1321204164 Implemented by #11927 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss closed issue #10236: look into adding -XX:ActiveProcessorCount=1 to parallel build jvms [LUCENE-9196]
dweiss closed issue #10236: look into adding -XX:ActiveProcessorCount=1 to parallel build jvms [LUCENE-9196] URL: https://github.com/apache/lucene/issues/10236 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new issue, #11957: false warning around 'gitStatus' on newish macos X
rmuir opened a new issue, #11957: URL: https://github.com/apache/lucene/issues/11957 ### Description When I run builds on mac, I see: ``` > Task :gitStatus Cannot run program "/usr/bin/git" (in directory "/usr/bin"): error=2, No such file or directory ``` But the error is actually harmless. `checkWorkingCopy` etc work as you expect. It just confuses me... The jgit does pretty crazy stuff on macos hosts, and I think the issues are there? On my mac `/usr/bin/git` is indeed git and not some xcode wrapper. See logic in https://git.eclipse.org/c/jgit/jgit.git/tree/org.eclipse.jgit/src/org/eclipse/jgit/util/FS_POSIX.java#n145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #11957: false warning around 'gitStatus' on newish macos X
dweiss commented on issue #11957: URL: https://github.com/apache/lucene/issues/11957#issuecomment-1321223132 Perhaps we should upgrade jgit first? scriptDepVersions.gradle, the latest one is 6.3.0.202209071007, https://repo1.maven.org/maven2/org/eclipse/jgit/org.eclipse.jgit/6.3.0.202209071007-r/ I'm surprised it searches for native git... wasn't it supposed to be pure java? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #11957: false warning around 'gitStatus' on newish macos X
dweiss commented on issue #11957: URL: https://github.com/apache/lucene/issues/11957#issuecomment-1321226220 I looked at the source and the only reason it tries to locate system git is to find out where its config files are. It's actually configurable via: ``` /** * The environment variable that blocks use of the system config file * * @since 3.3 */ public static final String GIT_CONFIG_NOSYSTEM_KEY = "GIT_CONFIG_NOSYSTEM"; ``` So maybe we can just switch it off this way? I don't think it'll make a difference for us. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #11957: false warning around 'gitStatus' on newish macos X
dweiss commented on issue #11957: URL: https://github.com/apache/lucene/issues/11957#issuecomment-1321229903 Yeah... it's absolutely terrible that jgit blindly forks jgit off the path. I just made the Lucene build open a file editor by renaming it to git.exe... I think we should just set the GIT_CONFIG_NOSYSTEM env. variable in gradlew launch scripts. Dumb. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11957: false warning around 'gitStatus' on newish macos X
rmuir commented on issue #11957: URL: https://github.com/apache/lucene/issues/11957#issuecomment-1321248831 good find, thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow opened a new pull request, #11958: GITHUB-11868: Add FilterIndexInput and FilterIndexOutput wrapper classes
mdmarshmallow opened a new pull request, #11958: URL: https://github.com/apache/lucene/pull/11958 ### Description Added a new `FilterIndexInput` and `FilterIndexOutput` class to handle delegate `IndexOutput` and `IndexInput` classes. I also changed some of the current subclasses that use these delegate patterns to extend `FilterIndexInput` and `FilterIndexOutput`. Here is a list of the subclasses I changed/didn't change: * IndexOutput changed: * EndiannessReverserIndexOutput * RateLimitedIndexOutput * SlowIndexOutput * CorruptingIndexOutput * MockIndexOutputWrapper * ThrottledIndexOutput * IndexOutput unchanged: * ByteBuffersIndexOutput * OutputStreamIndexOutput * DirectIOIndexOutput * IndexInput changed: * EndiannessReverserIndexInput * SeekCountingStream * CountingStream * SlowIndexInput * ReadBytesIndexInputWrapper * MockIndexWrapper * IndexInput unchanged: * BufferedIndexInput * ByteBufferIndexInput * ByteBuffersIndexInput * ChecksumIndexInput * MemorySegmentIndexInput * InterceptingIndexInput * DirectIOIndexInput -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11957: false warning around 'gitStatus' on newish macos X
rmuir commented on issue #11957: URL: https://github.com/apache/lucene/issues/11957#issuecomment-1321256630 i tried it out, `GIT_CONFIG_NOSYSTEM=1 ./gradlew gitStatus` no longer prints the confusing error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita commented on pull request #11815: Support deletions in rearrange (#11814)
stefanvodita commented on PR #11815: URL: https://github.com/apache/lucene/pull/11815#issuecomment-1321267832 Thank you for the ideas @zhaih , I’ve incorporated all of them. Your suggestion of using `ByteRef` in the keyset led to me finding a bug in the tests. The index creation methods didn’t delete any docs because I had stored the "delete" marker in a `BinaryDocValuesField`. That’s fixed now and the tests are more thorough with their checks. @mikemccand , I don’t know how doc-values updates work. Why would they not be covered by this solution? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #11815: Support deletions in rearrange (#11814)
zhaih commented on code in PR #11815: URL: https://github.com/apache/lucene/pull/11815#discussion_r1027612271 ## lucene/misc/src/java/org/apache/lucene/misc/index/BinaryDocValueSelector.java: ## @@ -30,52 +31,94 @@ import org.apache.lucene.store.Directory; import org.apache.lucene.util.BitSet; import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.FixedBitSet; -/** Select documents using binary doc values */ +/** + * Use this selector to rearrange an index where documents can be uniquely identified based on + * {@link BinaryDocValues} + */ public class BinaryDocValueSelector implements IndexRearranger.DocumentSelector, Serializable { private final String field; - private final HashSet keySet; + private final Set keySet; - public BinaryDocValueSelector(String field, HashSet keySet) { + public BinaryDocValueSelector(String field, Set keySet) { this.field = field; this.keySet = keySet; } @Override - public BitSet getFilteredLiveDocs(CodecReader reader) throws IOException { + public BitSet getFilteredDocs(CodecReader reader) throws IOException { BinaryDocValues binaryDocValues = reader.getBinaryDocValues(field); -Bits oldLiveDocs = reader.getLiveDocs(); FixedBitSet bits = new FixedBitSet(reader.maxDoc()); -for (int i = 0; i < reader.maxDoc(); i++) { - if (oldLiveDocs != null && oldLiveDocs.get(i) == false) { -continue; - } - if (binaryDocValues.advanceExact(i) - && keySet.contains(binaryDocValues.binaryValue().utf8ToString())) { -bits.set(i); +for (int docid = 0; docid < reader.maxDoc(); docid++) { + if (binaryDocValues.advanceExact(docid) && keySet.contains(binaryDocValues.binaryValue())) { +bits.set(docid); } } return bits; } - public static List createFromExistingIndex( + /** + * Create a selector for the deletes in an index, which can then be applied to a rearranged index + * + * @param field tells which {@link BinaryDocValues} are the unique key + * @param directory where the original index is present + * @return a deletes selector to be passed to {@link IndexRearranger} + */ + public static IndexRearranger.DocumentSelector createDeleteSelectorFromIndex( String field, Directory directory) throws IOException { -List selectors = new ArrayList<>(); + +Set keySet = new HashSet<>(); + try (IndexReader reader = DirectoryReader.open(directory)) { for (LeafReaderContext context : reader.leaves()) { -HashSet keySet = new HashSet<>(); Bits liveDocs = context.reader().getLiveDocs(); +if (liveDocs == null) { + continue; +} + BinaryDocValues binaryDocValues = context.reader().getBinaryDocValues(field); -for (int i = 0; i < context.reader().maxDoc(); i++) { - if (liveDocs != null && liveDocs.get(i) == false) { +for (int docid = 0; docid < context.reader().maxDoc(); docid++) { + if (liveDocs.get(docid) == true) { continue; } - if (binaryDocValues.advanceExact(i)) { -keySet.add(binaryDocValues.binaryValue().utf8ToString()); + + if (binaryDocValues.advanceExact(docid)) { +keySet.add(new BytesRef(binaryDocValues.binaryValue().bytes.clone())); Review Comment: See comment below :) ## lucene/misc/src/java/org/apache/lucene/misc/index/BinaryDocValueSelector.java: ## @@ -30,52 +31,94 @@ import org.apache.lucene.store.Directory; import org.apache.lucene.util.BitSet; import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.FixedBitSet; -/** Select documents using binary doc values */ +/** + * Use this selector to rearrange an index where documents can be uniquely identified based on + * {@link BinaryDocValues} + */ public class BinaryDocValueSelector implements IndexRearranger.DocumentSelector, Serializable { private final String field; - private final HashSet keySet; + private final Set keySet; - public BinaryDocValueSelector(String field, HashSet keySet) { + public BinaryDocValueSelector(String field, Set keySet) { this.field = field; this.keySet = keySet; } @Override - public BitSet getFilteredLiveDocs(CodecReader reader) throws IOException { + public BitSet getFilteredDocs(CodecReader reader) throws IOException { BinaryDocValues binaryDocValues = reader.getBinaryDocValues(field); -Bits oldLiveDocs = reader.getLiveDocs(); FixedBitSet bits = new FixedBitSet(reader.maxDoc()); -for (int i = 0; i < reader.maxDoc(); i++) { - if (oldLiveDocs != null && oldLiveDocs.get(i) == false) { -continue; - } - if (binaryDocValues.advanceExact(i) - && keySet.contains(binaryDocValues.binaryValue().utf8ToString())) { -bits.set(i); +for (int docid = 0; docid < reader.maxDoc();