[GitHub] [lucene] msokolov commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-20 Thread GitBox


msokolov commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1027282652


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/ExpandingVectorValues.java:
##
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.backward_codecs.lucene94;
+
+import java.io.IOException;
+import org.apache.lucene.index.FilterVectorValues;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.util.BytesRef;
+
+/** reads from byte-encoded data */
+public class ExpandingVectorValues extends FilterVectorValues {
+
+  private final float[] value;
+
+  /**
+   * Constructs ExpandingVectorValues with passed byte encoded VectorValues 
iterator
+   *
+   * @param in the wrapped values
+   */
+  protected ExpandingVectorValues(VectorValues in) {

Review Comment:
   Makes sense to me. The caller can always do the conversion if they like.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


kotman12 commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321155887

   Interesting .. this synchronization might not actually be doing anything 
after all. But I still think there are some thread-unsafe calls here regarding 
the integration as a whole which make me a bit paranoid. For instance, there 
seems to be some code paths that which interact with the singleton factory of 
the `POSModel`. The `POSModel` itself is a singleton that is stored in this 
static `posTaggerModels` map. 
   
   If you follow 
   
   ```
   OpenNLPPOSFilterFactory.create -> 
   NLPPOSTaggerOp.getPOSTagger -> 
   NLPPOSTaggerOp::new -> 
   POSTaggerME::new -> 
   POSTaggerFactory.getPOSContextGenerator -> 
   ConfigurablePOSContextGenerator::new -> 
   POSTaggerFactory.createFeatureGenerators 
   ```
   you will see it end up here 
https://github.com/apache/opennlp/blob/6c9e604e74c941935a3c11f9ba91a7c0f8e1ec4c/opennlp-tools/src/main/java/opennlp/tools/postag/POSTaggerFactory.java#L149
 which at first glance appears like a thread-unsafe lazy initialization. I also 
remember there being other examples. 
   
   Having said all that, I can't immediately say whether or not the methods you 
have unsynchronized actually needed to be that way and am scratching my head 
why they were synched in the first place. But some of these kinds of issues are 
tricky to pin down even with tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


kotman12 commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321159282

   > See background on `java-user@`: 
https://lists.apache.org/thread/okm7c49r7rv53z7v392v2v4pbv6m1pnw
   > 
   > The `BaseTokenStreamTestCase.checkRandomData` has proven itself effective 
at finding thread-safety bugs (e.g. deep in ICU and commons-codec and other 3rd 
party libraries). I beasted the tests a few times with nightly and multiplier 
and didn't have issues.
   
   Does this library also check for race conditions that can arise between and 
ResourceLoaderAware::inform vs TokenStream creation and processing? I know it 
may be out of the scope of this change but I would be curious to know.. Open 
NLP has these heavy-weight global resources and I am worried as to how they get 
initialized. There is an intractable amount code paths to analyze once you take 
into account all of the configuration possibilities of OpenNLP which seems to 
be partially based on the binary model file you provide.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


rmuir commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321166306

   I think there is no contention today due to the way the factories work. 
There is only one thread/tokenstream for *Op. But I am happy to add the random 
tests and remove the code smell? It may also confuse profilers which is what is 
really needed here to see where the performance issues are.
   
   Some things in opennlp are definitely hurting performance, such as only 
supporting String in their apis. This will create gazillions of objects. But I 
doubt it's the biggest issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


rmuir commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321167684

   > Does this library also check for race conditions that can arise between 
ResourceLoaderAware::inform vs TokenStream creation and processing? I know it 
may be out of the scope of this change but I would be curious to know.. Open 
NLP has these heavy-weight global resources and I am worried as to how they get 
initialized. There is an intractable amount code paths to analyze once you take 
into account all of the configuration possibilities of OpenNLP which seems to 
be partially based on the binary model file you provide.
   
   You can see the code here to the current checker:
   
   
https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java#L949-L951
   
   Basically, we analyze content single-threaded and then spin up threads with 
a "starting gun" and ensure they create the same result as the single-thread to 
discover races.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


rmuir commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321170185

   > Does this library also check for race conditions that can arise between 
ResourceLoaderAware::inform vs TokenStream creation and processing? I know it 
may be out of the scope of this change but I would be curious to know..
   
   Specific to this, I think one potential plan: we could refactor the tests 
more to check for it. Existing tests are using `BaseTokenStreamTestCase` but we 
could also test factories with 
https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamFactoryTestCase.java
   
   And maybe we could add evil stuff to this `BaseTokenStreamFactoryTestCase` 
to root out any factory-specific thread hazards across all of our factories 
(including opennlp).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


rmuir commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321171664

   and looking more, I think the answer is "no there isn't any race". Calling 
inform() is basically like part of the initialization of the thing. The factory 
isn't "done" being created until after this happens. It is silly that it is a 
separate method but that's just legacy cruft. You can look at CustomAnalyzer 
source code to see what I mean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


rmuir commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321172760

   As far as better stress-testing factories in its base class, we already 
stress-test using multiple tokenstreams (threads) across a single factory here. 
   
   We could add stuff to the BaseTokenStreamFactoryTestCase to create N 
factories of the same type, just for more paranoia, in case there was some race 
with shared resources across multiple instances of the same factory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


kotman12 commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321174402

   I will take a look at the test locally, thanks.
   
   >  There is only one thread/tokenstream for *Op.
   
   I think we are talking about different things. The mutable members within 
*Op are not my concern, although I could see how you would think that because 
at first glance that is what those synch methods may have been guarding. I am 
more worried about the lazy initialization of members at the Factory level 
which appear to be singletons and are called from Filter::create, which 
presumably happens on multiple threads at once (and even today is 
unsynchronized). Please correct me if my thinking is incorrect but when I saw 
that I started to question the open-nlp library at a deeper level.
   
   I'll reiterate that I see no immediate problem with this change since I 
can't prove that these synchronized blocks were even doing anything. But I am 
questioning the thread safety of the integration as it currently stands, 
irrespective of this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


kotman12 commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321176867

   > I will take a look at the test locally, thanks.
   > 
   > > There is only one thread/tokenstream for *Op.
   > 
   > I think we are talking about different things. The mutable members within 
*Op are not my concern, although I could see how you would think that because 
at first glance that is what those synch methods may have been guarding. I am 
more worried about the lazy initialization of members at the **NLP** Factory 
level which appear to be singletons and are indirectly called from lucene's 
`FilterFactory::create`, which presumably happens on multiple threads at once 
(and even today is unsynchronized). Please c`rrect me if my thinking is 
incorrect but when I saw that I started to question the open-nlp library at a 
deeper level.
   > 
   > I'll reiterate that I see no immediate problem with this change since I 
can't prove that these synchronized blocks were even doing anything. But I am 
questioning the thread safety of the integration as it currently stands, 
irrespective of this change.
   
   I might raise a separate issue to not confuse this thread anymore... what I 
am talking about is really only tangentially related to this current PR but in 
my mind it puts into question the thread safety of the integration as it stands 
now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


kotman12 commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321181537

   > As far as better stress-testing factories in its base class, we already 
stress-test using multiple tokenstreams (threads) across a single factory here.
   > 
   > We could add stuff to the BaseTokenStreamFactoryTestCase to create N 
factories of the same type, just for more paranoia, in case there was some race 
with shared resources across multiple instances of the same factory.
   
   Yes this is exactly my paranoia 😃 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


rmuir commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321189244

   ok, good, I think we are on the same page. sorry, i dont know anything about 
opennlp and I'm not familiar with their codebase at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11955: Remove synchronization from OpenNLP integration and add thread-safety tests(checkRandomData)

2022-11-20 Thread GitBox


rmuir commented on PR #11955:
URL: https://github.com/apache/lucene/pull/11955#issuecomment-1321193321

   and i don't mind doing the work here on the testing honestly. or we can do 
it in another PR.
   basically I think we should engage the BaseAnalysisFactoryTestCase with 
simple basic tests like the other factories have. that's a start. today the 
factories are not really tested directly. Usually its best to at least check 
different arguments/parameters and exceptional cases, e.g. 
https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TestLimitTokenOffsetFilterFactory.java
   
   then we can look at if base test class needs any beefing up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss closed issue #10709: gradle precommit sometimes fails with "IOException: stream closed" from javadoc in nightly benchmarks [LUCENE-9670]

2022-11-20 Thread GitBox


dweiss closed issue #10709: gradle precommit sometimes fails with "IOException: 
stream closed" from javadoc in nightly benchmarks [LUCENE-9670]
URL: https://github.com/apache/lucene/issues/10709


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on issue #10236: look into adding -XX:ActiveProcessorCount=1 to parallel build jvms [LUCENE-9196]

2022-11-20 Thread GitBox


dweiss commented on issue #10236:
URL: https://github.com/apache/lucene/issues/10236#issuecomment-1321204164

   Implemented by #11927 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss closed issue #10236: look into adding -XX:ActiveProcessorCount=1 to parallel build jvms [LUCENE-9196]

2022-11-20 Thread GitBox


dweiss closed issue #10236: look into adding -XX:ActiveProcessorCount=1 to 
parallel build jvms [LUCENE-9196]
URL: https://github.com/apache/lucene/issues/10236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new issue, #11957: false warning around 'gitStatus' on newish macos X

2022-11-20 Thread GitBox


rmuir opened a new issue, #11957:
URL: https://github.com/apache/lucene/issues/11957

   ### Description
   
   When I run builds on mac, I see:
   ```
   > Task :gitStatus
   Cannot run program "/usr/bin/git" (in directory "/usr/bin"): error=2, No 
such file or directory
   ```
   
   But the error is actually harmless. `checkWorkingCopy` etc work as you 
expect. It just confuses me...
   
   The jgit does pretty crazy stuff on macos hosts, and I think the issues are 
there? On my mac `/usr/bin/git` is indeed git and not some xcode wrapper.
   
   See logic in 
https://git.eclipse.org/c/jgit/jgit.git/tree/org.eclipse.jgit/src/org/eclipse/jgit/util/FS_POSIX.java#n145
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on issue #11957: false warning around 'gitStatus' on newish macos X

2022-11-20 Thread GitBox


dweiss commented on issue #11957:
URL: https://github.com/apache/lucene/issues/11957#issuecomment-1321223132

   Perhaps we should upgrade jgit first? scriptDepVersions.gradle, the latest 
one is 6.3.0.202209071007, 
https://repo1.maven.org/maven2/org/eclipse/jgit/org.eclipse.jgit/6.3.0.202209071007-r/
   
   I'm surprised it searches for native git... wasn't it supposed to be pure 
java?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on issue #11957: false warning around 'gitStatus' on newish macos X

2022-11-20 Thread GitBox


dweiss commented on issue #11957:
URL: https://github.com/apache/lucene/issues/11957#issuecomment-1321226220

   I looked at the source and the only reason it tries to locate system git is 
to find out where its config files are. It's actually configurable via:
   ```
/**
 * The environment variable that blocks use of the system config file
 *
 * @since 3.3
 */
public static final String GIT_CONFIG_NOSYSTEM_KEY = 
"GIT_CONFIG_NOSYSTEM";
```
   
   So maybe we can just switch it off this way? I don't think it'll make a 
difference for us.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on issue #11957: false warning around 'gitStatus' on newish macos X

2022-11-20 Thread GitBox


dweiss commented on issue #11957:
URL: https://github.com/apache/lucene/issues/11957#issuecomment-1321229903

   Yeah... it's absolutely terrible that jgit blindly forks jgit off the path. 
I just made the Lucene build open a file editor by renaming it to git.exe...
   
   I think we should just set the GIT_CONFIG_NOSYSTEM env. variable in gradlew 
launch scripts. Dumb.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11957: false warning around 'gitStatus' on newish macos X

2022-11-20 Thread GitBox


rmuir commented on issue #11957:
URL: https://github.com/apache/lucene/issues/11957#issuecomment-1321248831

   good find, thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow opened a new pull request, #11958: GITHUB-11868: Add FilterIndexInput and FilterIndexOutput wrapper classes

2022-11-20 Thread GitBox


mdmarshmallow opened a new pull request, #11958:
URL: https://github.com/apache/lucene/pull/11958

   ### Description
   
   Added a new `FilterIndexInput` and `FilterIndexOutput` class to handle 
delegate `IndexOutput` and `IndexInput` classes.
   
   I also changed some of the current subclasses that use these delegate 
patterns to extend `FilterIndexInput` and `FilterIndexOutput`. Here is a list 
of the subclasses I changed/didn't change:
   
   * IndexOutput changed:
   * EndiannessReverserIndexOutput
   * RateLimitedIndexOutput
   * SlowIndexOutput
   * CorruptingIndexOutput
   * MockIndexOutputWrapper
   * ThrottledIndexOutput
   
   * IndexOutput unchanged:
   * ByteBuffersIndexOutput
   * OutputStreamIndexOutput
   * DirectIOIndexOutput
   
   * IndexInput changed:
   * EndiannessReverserIndexInput
   * SeekCountingStream
   * CountingStream
   * SlowIndexInput
   * ReadBytesIndexInputWrapper
   * MockIndexWrapper
   
   * IndexInput unchanged:
   * BufferedIndexInput
   * ByteBufferIndexInput
   * ByteBuffersIndexInput
   * ChecksumIndexInput
   * MemorySegmentIndexInput
   * InterceptingIndexInput
   * DirectIOIndexInput


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11957: false warning around 'gitStatus' on newish macos X

2022-11-20 Thread GitBox


rmuir commented on issue #11957:
URL: https://github.com/apache/lucene/issues/11957#issuecomment-1321256630

   i tried it out, `GIT_CONFIG_NOSYSTEM=1 ./gradlew gitStatus` no longer prints 
the confusing error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on pull request #11815: Support deletions in rearrange (#11814)

2022-11-20 Thread GitBox


stefanvodita commented on PR #11815:
URL: https://github.com/apache/lucene/pull/11815#issuecomment-1321267832

   Thank you for the ideas @zhaih , I’ve incorporated all of them. Your 
suggestion of using `ByteRef` in the keyset led to me finding a bug in the 
tests. The index creation methods didn’t delete any docs because I had stored 
the "delete" marker in a `BinaryDocValuesField`. That’s fixed now and the tests 
are more thorough with their checks.
   
   @mikemccand , I don’t know how doc-values updates work. Why would they not 
be covered by this solution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #11815: Support deletions in rearrange (#11814)

2022-11-20 Thread GitBox


zhaih commented on code in PR #11815:
URL: https://github.com/apache/lucene/pull/11815#discussion_r1027612271


##
lucene/misc/src/java/org/apache/lucene/misc/index/BinaryDocValueSelector.java:
##
@@ -30,52 +31,94 @@
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.util.BitSet;
 import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
 import org.apache.lucene.util.FixedBitSet;
 
-/** Select documents using binary doc values */
+/**
+ * Use this selector to rearrange an index where documents can be uniquely 
identified based on
+ * {@link BinaryDocValues}
+ */
 public class BinaryDocValueSelector implements 
IndexRearranger.DocumentSelector, Serializable {
 
   private final String field;
-  private final HashSet keySet;
+  private final Set keySet;
 
-  public BinaryDocValueSelector(String field, HashSet keySet) {
+  public BinaryDocValueSelector(String field, Set keySet) {
 this.field = field;
 this.keySet = keySet;
   }
 
   @Override
-  public BitSet getFilteredLiveDocs(CodecReader reader) throws IOException {
+  public BitSet getFilteredDocs(CodecReader reader) throws IOException {
 BinaryDocValues binaryDocValues = reader.getBinaryDocValues(field);
-Bits oldLiveDocs = reader.getLiveDocs();
 FixedBitSet bits = new FixedBitSet(reader.maxDoc());
-for (int i = 0; i < reader.maxDoc(); i++) {
-  if (oldLiveDocs != null && oldLiveDocs.get(i) == false) {
-continue;
-  }
-  if (binaryDocValues.advanceExact(i)
-  && keySet.contains(binaryDocValues.binaryValue().utf8ToString())) {
-bits.set(i);
+for (int docid = 0; docid < reader.maxDoc(); docid++) {
+  if (binaryDocValues.advanceExact(docid) && 
keySet.contains(binaryDocValues.binaryValue())) {
+bits.set(docid);
   }
 }
 return bits;
   }
 
-  public static List createFromExistingIndex(
+  /**
+   * Create a selector for the deletes in an index, which can then be applied 
to a rearranged index
+   *
+   * @param field tells which {@link BinaryDocValues} are the unique key
+   * @param directory where the original index is present
+   * @return a deletes selector to be passed to {@link IndexRearranger}
+   */
+  public static IndexRearranger.DocumentSelector createDeleteSelectorFromIndex(
   String field, Directory directory) throws IOException {
-List selectors = new ArrayList<>();
+
+Set keySet = new HashSet<>();
+
 try (IndexReader reader = DirectoryReader.open(directory)) {
   for (LeafReaderContext context : reader.leaves()) {
-HashSet keySet = new HashSet<>();
 Bits liveDocs = context.reader().getLiveDocs();
+if (liveDocs == null) {
+  continue;
+}
+
 BinaryDocValues binaryDocValues = 
context.reader().getBinaryDocValues(field);
-for (int i = 0; i < context.reader().maxDoc(); i++) {
-  if (liveDocs != null && liveDocs.get(i) == false) {
+for (int docid = 0; docid < context.reader().maxDoc(); docid++) {
+  if (liveDocs.get(docid) == true) {
 continue;
   }
-  if (binaryDocValues.advanceExact(i)) {
-keySet.add(binaryDocValues.binaryValue().utf8ToString());
+
+  if (binaryDocValues.advanceExact(docid)) {
+keySet.add(new 
BytesRef(binaryDocValues.binaryValue().bytes.clone()));

Review Comment:
   See comment below :)



##
lucene/misc/src/java/org/apache/lucene/misc/index/BinaryDocValueSelector.java:
##
@@ -30,52 +31,94 @@
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.util.BitSet;
 import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
 import org.apache.lucene.util.FixedBitSet;
 
-/** Select documents using binary doc values */
+/**
+ * Use this selector to rearrange an index where documents can be uniquely 
identified based on
+ * {@link BinaryDocValues}
+ */
 public class BinaryDocValueSelector implements 
IndexRearranger.DocumentSelector, Serializable {
 
   private final String field;
-  private final HashSet keySet;
+  private final Set keySet;
 
-  public BinaryDocValueSelector(String field, HashSet keySet) {
+  public BinaryDocValueSelector(String field, Set keySet) {
 this.field = field;
 this.keySet = keySet;
   }
 
   @Override
-  public BitSet getFilteredLiveDocs(CodecReader reader) throws IOException {
+  public BitSet getFilteredDocs(CodecReader reader) throws IOException {
 BinaryDocValues binaryDocValues = reader.getBinaryDocValues(field);
-Bits oldLiveDocs = reader.getLiveDocs();
 FixedBitSet bits = new FixedBitSet(reader.maxDoc());
-for (int i = 0; i < reader.maxDoc(); i++) {
-  if (oldLiveDocs != null && oldLiveDocs.get(i) == false) {
-continue;
-  }
-  if (binaryDocValues.advanceExact(i)
-  && keySet.contains(binaryDocValues.binaryValue().utf8ToString())) {
-bits.set(i);
+for (int docid = 0; docid < reader.maxDoc();