[GitHub] [lucene] romseygeek commented on a diff in pull request #11990: Don't let merged passages push out lower-scoring ones

2022-12-01 Thread GitBox


romseygeek commented on code in PR #11990:
URL: https://github.com/apache/lucene/pull/11990#discussion_r1036896059


##
lucene/highlighter/src/java/org/apache/lucene/search/matchhighlight/PassageSelector.java:
##
@@ -89,8 +89,9 @@ public List pickBest(
 }
 
 // Best passages so far.
+int pqSize = Math.max(markers.size(), maxPassages);

Review Comment:
   > min(markers.size(), maxPassages * 3)
   
   Another option might be to just have a minimum size of 16, given that this 
is only really a problem when you're asking for two or three passages.  Once 
you get above 10 passages then one or two less becomes less of an issue



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] renthus opened a new issue, #11991: Export Luke File

2022-12-01 Thread GitBox


renthus opened a new issue, #11991:
URL: https://github.com/apache/lucene/issues/11991

   ### Description
   
   Luke project: Lucene Toolbox Project v9.4.1I'm 
trying to export files from Regex:BR_CAR_PLATE, for example, but it generates 
an error due to the character <:> and we can't change that because the system 
generates the file in the folder with this name. 
   
![print](https://user-images.githubusercontent.com/49447595/205037089-52840d8d-2577-4c91-96af-41afc641ebc2.PNG)
   
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #11990: Don't let merged passages push out lower-scoring ones

2022-12-01 Thread GitBox


dweiss commented on code in PR #11990:
URL: https://github.com/apache/lucene/pull/11990#discussion_r1036988933


##
lucene/highlighter/src/java/org/apache/lucene/search/matchhighlight/PassageSelector.java:
##
@@ -89,8 +89,9 @@ public List pickBest(
 }
 
 // Best passages so far.
+int pqSize = Math.max(markers.size(), maxPassages);

Review Comment:
   That sounds good to me as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek merged pull request #11990: Don't let merged passages push out lower-scoring ones

2022-12-01 Thread GitBox


romseygeek merged PR #11990:
URL: https://github.com/apache/lucene/pull/11990


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] luyuncheng commented on pull request #11987: Make Decompressor release memory buffer

2022-12-01 Thread GitBox


luyuncheng commented on PR #11987:
URL: https://github.com/apache/lucene/pull/11987#issuecomment-1333730146

   > We could run the StoredFieldsBenchmark before and after the change with 
-Dorg.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.enableBulkMerge=false
 to force the slow merge path.
   > In all cases when running the benchmark, we may want to explicitly supply 
smaller heap (-Xmx),
   
   @rmuir I just modified 
https://github.com/mikemccand/luceneutil/blob/master/src/python/runStoredFieldsBenchmark.py#L43
 with 
   `command = f'{localconstants.JAVA_EXE} -Xmx256m 
-Dorg.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.enableBulkMerge=false
 -cp {lucene_core_jar}:build perf.StoredFieldsBenchmark {geonames_csv_in} 
{localconstants.INDEX_DIR_BASE}/geonames-stored-fields {mode} {doc_limit}`
   
   i do 4 different runStoredFieldsBenchmark as following tables shows which 
shows little performance regressions:
   
   runStoredFieldsBenchmark.py __enableBulkMerge=false__
   |   | Baseline | Candidate |
   | :---|::   |  ---: |
   | indexing_time_msec| | |
   | BEST_SPEED  | 365665.00   | 372287.00  |
   | BEST_COMPRESSION   | 849157.00| 848813.00   |
   | retrieved_time_msec | | |
   | BEST_SPEED |  246.62 |  269.32 | 
   | BEST_COMPRESSION | 2606.98 | 2634.53  | 
   
   runStoredFieldsBenchmark.py __enableBulkMerge=false -Xmx1g__
   | | Baseline  | Candidate  |
   | :---|  ::   |---:|
   | indexing_time_msec  |   ||
   | BEST_SPEED  | 372457.00 | 366094.00  |
   | BEST_COMPRESSION| 850273.00 | 852397.00  |
   | retrieved_time_msec |   ||
   | BEST_SPEED  |  247.70   |  279.11| 
   | BEST_COMPRESSION| 2585.59   | 2633.83| 
   
   runStoredFieldsBenchmark.py __enableBulkMerge=false -Xmx512m__
   | | Baseline  | Candidate  |
   | :---|  ::   |---:|
   | indexing_time_msec  |   ||
   | BEST_SPEED  | 368389.00 | 370878.00  |
   | BEST_COMPRESSION| 851277.00 | 850121.00  |
   | retrieved_time_msec |   ||
   | BEST_SPEED  |  256.80   |  280.52| 
   | BEST_COMPRESSION| 2576.36   | 2645.32| 
   
   runStoredFieldsBenchmark.py __enableBulkMerge=false -Xmx256m__
   | | Baseline  | Candidate  |
   | :---|  ::   |---:|
   | indexing_time_msec  |   ||
   | BEST_SPEED  | 366735.00 | 368407.00  |
   | BEST_COMPRESSION| 849980.00 | 852214.00  |
   | retrieved_time_msec |   ||
   | BEST_SPEED  |  256.10   |  278.06| 
   | BEST_COMPRESSION| 2584.96   | 2632.69| 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] thecoop opened a new pull request, #11992: Create shared method for creating view buffers in ByteBufferIndexInput

2022-12-01 Thread GitBox


thecoop opened a new pull request, #11992:
URL: https://github.com/apache/lucene/pull/11992

   Add some tests around reading values in ByteBuffersDataInput
   
   The refactorings are from #11982


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] thecoop commented on pull request #11982: Change ByteBuffersDataInput and ByteBuffersIndexInput to use absolute addressing

2022-12-01 Thread GitBox


thecoop commented on PR #11982:
URL: https://github.com/apache/lucene/pull/11982#issuecomment-1333760940

   I've separated out the refactoring into #11992, so that can be reviewed as-is


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] thecoop closed pull request #11982: Change ByteBuffersDataInput and ByteBuffersIndexInput to use absolute addressing

2022-12-01 Thread GitBox


thecoop closed pull request #11982: Change ByteBuffersDataInput and 
ByteBuffersIndexInput to use absolute addressing
URL: https://github.com/apache/lucene/pull/11982


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11987: Make Decompressor release memory buffer

2022-12-01 Thread GitBox


rmuir commented on PR #11987:
URL: https://github.com/apache/lucene/pull/11987#issuecomment-1333788113

   thanks for running. somehow i think bulk merge didnt get disabled. without 
bulk merge optimization, indexing time should be significantly higher, the 
benchmark should be very very slow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11971: Disable useless error-prone checks (libraries/frameworks we do not use)

2022-12-01 Thread GitBox


rmuir commented on PR #11971:
URL: https://github.com/apache/lucene/pull/11971#issuecomment-1333792349

   no feedback for a few days, i'm going to merge this as it disables stuff 
that's really not controversial, as it is providing no value. as far as which 
of the other checks are enabled, we can debate in other issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #11971: Disable useless error-prone checks (libraries/frameworks we do not use)

2022-12-01 Thread GitBox


rmuir merged PR #11971:
URL: https://github.com/apache/lucene/pull/11971


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11974: fix wrong serialization by ShapeDocValues

2022-12-01 Thread GitBox


rmuir commented on PR #11974:
URL: https://github.com/apache/lucene/pull/11974#issuecomment-1333827718

   @nknize any thoughts? sorry i haven't had time to dig into the guts of this, 
but would be good to get this check enabled


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] luyuncheng commented on pull request #11987: Make Decompressor release memory buffer

2022-12-01 Thread GitBox


luyuncheng commented on PR #11987:
URL: https://github.com/apache/lucene/pull/11987#issuecomment-1333932432

   > thanks for running. somehow i think bulk merge didnt get disabled. without 
bulk merge optimization, indexing time should be significantly higher, the 
benchmark should be very very slow.
   
   @rmuir  I am also curious about it. 
   BUT i use __arthas__ vmtools to see BULK_MERGE_ENABLED is __false__
   
![image](https://user-images.githubusercontent.com/12760367/205089684-cd61c66f-0918-4ab8-9fe7-e63be6d70fd7.png)
   
   and i manually modified 
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#BULK_MERGE_ENABLED
 as __false__
   
   when enableBulkMerge=false -Xmx256m
   it shows almost the same:
   
   | | Baseline  | Candidate  |
   | :---|  ::   |---:|
   | indexing_time_msec  |   ||
   | BEST_SPEED  | 366735.00 | 368407.00  |
   | BEST_COMPRESSION| 849980.00 | 852214.00  |
   | retrieved_time_msec |   ||
   | BEST_SPEED  |  256.10   |  278.06| 
   | BEST_COMPRESSION| 2584.96   | 2632.69| 
   
   AND my runStoredFieldsBenchmark `geonames.txt` dataset is about 1.5GB with 
12489745 lines allCountries data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on issue #11963: Improve vector quantization API

2022-12-01 Thread GitBox


benwtrent commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1334127223

   @jpountz 
   
   > which I don't like much for only two types that need to be supported?
   
   I do think we will add more types in the future. Specifically `binary` which 
can use an optimized hamming distance. This is useful for image search.
   
   Splitting it all up like this is a big effort but I suppose its how it 
should have bene from the beginning.
   
   Frustrating to block https://github.com/apache/lucene/pull/11860 for this. 
   
   I will see about iterating on this refactor and see what the fallout is.
   
   @rmuir if you have any further opinions on how you think this should look, 
let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nknize commented on pull request #11974: fix wrong serialization by ShapeDocValues

2022-12-01 Thread GitBox


nknize commented on PR #11974:
URL: https://github.com/apache/lucene/pull/11974#issuecomment-1334422558

   Yeah, nice catch @rmuir!!  Thanks for enabling the logical assignment 
error-prone check. I had no idea that was disabled! 
   
   > I feel like some sort of testcase should have been failing here all along?
   
   💯!  
   
   No test cases failed here because the edge membership boolean is only used 
in `CONTAINS` queries, which [isn't yet 
supported](https://github.com/apache/lucene/blob/0a9bb6e2ace2e07512d67dc346ca5d19d473c538/lucene/core/src/java/org/apache/lucene/document/BaseShapeDocValuesQuery.java#L48).
 Either way, I agree w/ you that we should have an explicit check that the 
serialization matches what's expected from the original shape. I'll write up a 
simple one to ensure coverage here. Thanks for catching this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-12-01 Thread GitBox


rmuir commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1037699596


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/ExpandingVectorValues.java:
##
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.backward_codecs.lucene94;
+
+import java.io.IOException;
+import org.apache.lucene.index.FilterVectorValues;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.util.BytesRef;
+
+/** reads from byte-encoded data */
+public class ExpandingVectorValues extends FilterVectorValues {
+
+  private final float[] value;
+
+  /**
+   * Constructs ExpandingVectorValues with passed byte encoded VectorValues 
iterator
+   *
+   * @param in the wrapped values
+   */
+  protected ExpandingVectorValues(VectorValues in) {

Review Comment:
   @msokolov its easy make me look like the insane madman here and poke fun at 
me. but understand my frustration, I raised this on the original PR and it was 
just totally ignored :( 
https://github.com/apache/lucene/pull/947#issuecomment-1185833705
   
   I'm not even sure it was intentional or not. I'm not assuming any malice 
here, just communicating my frustration. for some reason the PRs kept getting 
blown away and completely recreated. yet I still tried to continue to help (and 
commented on all 3 PRs), but this made life even more difficult for me.
   
   please, unblock this issue and do whatever you like. i'm out of your way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11963: Improve vector quantization API

2022-12-01 Thread GitBox


rmuir commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1334640898

   @benwtrent go do your other issue first if you prefer. sorry for the trouble.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11974: fix wrong serialization by ShapeDocValues

2022-12-01 Thread GitBox


rmuir commented on PR #11974:
URL: https://github.com/apache/lucene/pull/11974#issuecomment-1334646881

   @nknize thank you for looking. 
   
   one other related suggestion i have for this code in the future would be to 
change code such as this:
   ```
   header |= 0x01;
   ```
   
   to something like this:
   ```
   // named constants grouped together which kinda documents the "format" so it 
is a bit easier
   // these names are just examples and might not be the best ones :)
   static final int HAS_LEFT_SUBTREE = 1 << 0;
   static final int HAS_RIGHT_SUBTREE = 1 << 1;
   ... 
   header |= HAS_LEFT_SUBTREE;
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir closed issue #11973: ShapeDocValues wrong serialization

2022-12-01 Thread GitBox


rmuir closed issue #11973: ShapeDocValues wrong serialization
URL: https://github.com/apache/lucene/issues/11973


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #11974: fix wrong serialization by ShapeDocValues

2022-12-01 Thread GitBox


rmuir merged PR #11974:
URL: https://github.com/apache/lucene/pull/11974


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] luyuncheng commented on pull request #11987: Make Decompressor release memory buffer

2022-12-01 Thread GitBox


luyuncheng commented on PR #11987:
URL: https://github.com/apache/lucene/pull/11987#issuecomment-1334866892

   > Yes, before we work around all that stuff here, I'd also suggest to remove 
those ThreadLocals.
   
   @uschindler I think this issue just have a GC path of ThreadLocals.  BUT, 
for instance in ES,  when there is a 1000-shard-nodes, and normally one shard 
with 40 segments per shard, one opened segments would allocate one buffer with 
retained heap: 100KB, so this would use 4G resident heap memory, even some 
segments are rarely used. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org