[GitHub] [lucene] romseygeek commented on a diff in pull request #11990: Don't let merged passages push out lower-scoring ones
romseygeek commented on code in PR #11990: URL: https://github.com/apache/lucene/pull/11990#discussion_r1036896059 ## lucene/highlighter/src/java/org/apache/lucene/search/matchhighlight/PassageSelector.java: ## @@ -89,8 +89,9 @@ public List pickBest( } // Best passages so far. +int pqSize = Math.max(markers.size(), maxPassages); Review Comment: > min(markers.size(), maxPassages * 3) Another option might be to just have a minimum size of 16, given that this is only really a problem when you're asking for two or three passages. Once you get above 10 passages then one or two less becomes less of an issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] renthus opened a new issue, #11991: Export Luke File
renthus opened a new issue, #11991: URL: https://github.com/apache/lucene/issues/11991 ### Description Luke project: Lucene Toolbox Project v9.4.1I'm trying to export files from Regex:BR_CAR_PLATE, for example, but it generates an error due to the character <:> and we can't change that because the system generates the file in the folder with this name.  ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #11990: Don't let merged passages push out lower-scoring ones
dweiss commented on code in PR #11990: URL: https://github.com/apache/lucene/pull/11990#discussion_r1036988933 ## lucene/highlighter/src/java/org/apache/lucene/search/matchhighlight/PassageSelector.java: ## @@ -89,8 +89,9 @@ public List pickBest( } // Best passages so far. +int pqSize = Math.max(markers.size(), maxPassages); Review Comment: That sounds good to me as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek merged pull request #11990: Don't let merged passages push out lower-scoring ones
romseygeek merged PR #11990: URL: https://github.com/apache/lucene/pull/11990 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on pull request #11987: Make Decompressor release memory buffer
luyuncheng commented on PR #11987: URL: https://github.com/apache/lucene/pull/11987#issuecomment-1333730146 > We could run the StoredFieldsBenchmark before and after the change with -Dorg.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.enableBulkMerge=false to force the slow merge path. > In all cases when running the benchmark, we may want to explicitly supply smaller heap (-Xmx), @rmuir I just modified https://github.com/mikemccand/luceneutil/blob/master/src/python/runStoredFieldsBenchmark.py#L43 with `command = f'{localconstants.JAVA_EXE} -Xmx256m -Dorg.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.enableBulkMerge=false -cp {lucene_core_jar}:build perf.StoredFieldsBenchmark {geonames_csv_in} {localconstants.INDEX_DIR_BASE}/geonames-stored-fields {mode} {doc_limit}` i do 4 different runStoredFieldsBenchmark as following tables shows which shows little performance regressions: runStoredFieldsBenchmark.py __enableBulkMerge=false__ | | Baseline | Candidate | | :---|:: | ---: | | indexing_time_msec| | | | BEST_SPEED | 365665.00 | 372287.00 | | BEST_COMPRESSION | 849157.00| 848813.00 | | retrieved_time_msec | | | | BEST_SPEED | 246.62 | 269.32 | | BEST_COMPRESSION | 2606.98 | 2634.53 | runStoredFieldsBenchmark.py __enableBulkMerge=false -Xmx1g__ | | Baseline | Candidate | | :---| :: |---:| | indexing_time_msec | || | BEST_SPEED | 372457.00 | 366094.00 | | BEST_COMPRESSION| 850273.00 | 852397.00 | | retrieved_time_msec | || | BEST_SPEED | 247.70 | 279.11| | BEST_COMPRESSION| 2585.59 | 2633.83| runStoredFieldsBenchmark.py __enableBulkMerge=false -Xmx512m__ | | Baseline | Candidate | | :---| :: |---:| | indexing_time_msec | || | BEST_SPEED | 368389.00 | 370878.00 | | BEST_COMPRESSION| 851277.00 | 850121.00 | | retrieved_time_msec | || | BEST_SPEED | 256.80 | 280.52| | BEST_COMPRESSION| 2576.36 | 2645.32| runStoredFieldsBenchmark.py __enableBulkMerge=false -Xmx256m__ | | Baseline | Candidate | | :---| :: |---:| | indexing_time_msec | || | BEST_SPEED | 366735.00 | 368407.00 | | BEST_COMPRESSION| 849980.00 | 852214.00 | | retrieved_time_msec | || | BEST_SPEED | 256.10 | 278.06| | BEST_COMPRESSION| 2584.96 | 2632.69| -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thecoop opened a new pull request, #11992: Create shared method for creating view buffers in ByteBufferIndexInput
thecoop opened a new pull request, #11992: URL: https://github.com/apache/lucene/pull/11992 Add some tests around reading values in ByteBuffersDataInput The refactorings are from #11982 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thecoop commented on pull request #11982: Change ByteBuffersDataInput and ByteBuffersIndexInput to use absolute addressing
thecoop commented on PR #11982: URL: https://github.com/apache/lucene/pull/11982#issuecomment-1333760940 I've separated out the refactoring into #11992, so that can be reviewed as-is -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thecoop closed pull request #11982: Change ByteBuffersDataInput and ByteBuffersIndexInput to use absolute addressing
thecoop closed pull request #11982: Change ByteBuffersDataInput and ByteBuffersIndexInput to use absolute addressing URL: https://github.com/apache/lucene/pull/11982 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11987: Make Decompressor release memory buffer
rmuir commented on PR #11987: URL: https://github.com/apache/lucene/pull/11987#issuecomment-1333788113 thanks for running. somehow i think bulk merge didnt get disabled. without bulk merge optimization, indexing time should be significantly higher, the benchmark should be very very slow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11971: Disable useless error-prone checks (libraries/frameworks we do not use)
rmuir commented on PR #11971: URL: https://github.com/apache/lucene/pull/11971#issuecomment-1333792349 no feedback for a few days, i'm going to merge this as it disables stuff that's really not controversial, as it is providing no value. as far as which of the other checks are enabled, we can debate in other issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #11971: Disable useless error-prone checks (libraries/frameworks we do not use)
rmuir merged PR #11971: URL: https://github.com/apache/lucene/pull/11971 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11974: fix wrong serialization by ShapeDocValues
rmuir commented on PR #11974: URL: https://github.com/apache/lucene/pull/11974#issuecomment-1333827718 @nknize any thoughts? sorry i haven't had time to dig into the guts of this, but would be good to get this check enabled -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on pull request #11987: Make Decompressor release memory buffer
luyuncheng commented on PR #11987: URL: https://github.com/apache/lucene/pull/11987#issuecomment-1333932432 > thanks for running. somehow i think bulk merge didnt get disabled. without bulk merge optimization, indexing time should be significantly higher, the benchmark should be very very slow. @rmuir I am also curious about it. BUT i use __arthas__ vmtools to see BULK_MERGE_ENABLED is __false__  and i manually modified lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#BULK_MERGE_ENABLED as __false__ when enableBulkMerge=false -Xmx256m it shows almost the same: | | Baseline | Candidate | | :---| :: |---:| | indexing_time_msec | || | BEST_SPEED | 366735.00 | 368407.00 | | BEST_COMPRESSION| 849980.00 | 852214.00 | | retrieved_time_msec | || | BEST_SPEED | 256.10 | 278.06| | BEST_COMPRESSION| 2584.96 | 2632.69| AND my runStoredFieldsBenchmark `geonames.txt` dataset is about 1.5GB with 12489745 lines allCountries data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #11963: Improve vector quantization API
benwtrent commented on issue #11963: URL: https://github.com/apache/lucene/issues/11963#issuecomment-1334127223 @jpountz > which I don't like much for only two types that need to be supported? I do think we will add more types in the future. Specifically `binary` which can use an optimized hamming distance. This is useful for image search. Splitting it all up like this is a big effort but I suppose its how it should have bene from the beginning. Frustrating to block https://github.com/apache/lucene/pull/11860 for this. I will see about iterating on this refactor and see what the fallout is. @rmuir if you have any further opinions on how you think this should look, let me know. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nknize commented on pull request #11974: fix wrong serialization by ShapeDocValues
nknize commented on PR #11974: URL: https://github.com/apache/lucene/pull/11974#issuecomment-1334422558 Yeah, nice catch @rmuir!! Thanks for enabling the logical assignment error-prone check. I had no idea that was disabled! > I feel like some sort of testcase should have been failing here all along? 💯! No test cases failed here because the edge membership boolean is only used in `CONTAINS` queries, which [isn't yet supported](https://github.com/apache/lucene/blob/0a9bb6e2ace2e07512d67dc346ca5d19d473c538/lucene/core/src/java/org/apache/lucene/document/BaseShapeDocValuesQuery.java#L48). Either way, I agree w/ you that we should have an explicit check that the serialization matches what's expected from the original shape. I'll write up a simple one to ensure coverage here. Thanks for catching this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections
rmuir commented on code in PR #11860: URL: https://github.com/apache/lucene/pull/11860#discussion_r1037699596 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/ExpandingVectorValues.java: ## @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.backward_codecs.lucene94; + +import java.io.IOException; +import org.apache.lucene.index.FilterVectorValues; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.util.BytesRef; + +/** reads from byte-encoded data */ +public class ExpandingVectorValues extends FilterVectorValues { + + private final float[] value; + + /** + * Constructs ExpandingVectorValues with passed byte encoded VectorValues iterator + * + * @param in the wrapped values + */ + protected ExpandingVectorValues(VectorValues in) { Review Comment: @msokolov its easy make me look like the insane madman here and poke fun at me. but understand my frustration, I raised this on the original PR and it was just totally ignored :( https://github.com/apache/lucene/pull/947#issuecomment-1185833705 I'm not even sure it was intentional or not. I'm not assuming any malice here, just communicating my frustration. for some reason the PRs kept getting blown away and completely recreated. yet I still tried to continue to help (and commented on all 3 PRs), but this made life even more difficult for me. please, unblock this issue and do whatever you like. i'm out of your way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11963: Improve vector quantization API
rmuir commented on issue #11963: URL: https://github.com/apache/lucene/issues/11963#issuecomment-1334640898 @benwtrent go do your other issue first if you prefer. sorry for the trouble. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11974: fix wrong serialization by ShapeDocValues
rmuir commented on PR #11974: URL: https://github.com/apache/lucene/pull/11974#issuecomment-1334646881 @nknize thank you for looking. one other related suggestion i have for this code in the future would be to change code such as this: ``` header |= 0x01; ``` to something like this: ``` // named constants grouped together which kinda documents the "format" so it is a bit easier // these names are just examples and might not be the best ones :) static final int HAS_LEFT_SUBTREE = 1 << 0; static final int HAS_RIGHT_SUBTREE = 1 << 1; ... header |= HAS_LEFT_SUBTREE; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir closed issue #11973: ShapeDocValues wrong serialization
rmuir closed issue #11973: ShapeDocValues wrong serialization URL: https://github.com/apache/lucene/issues/11973 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #11974: fix wrong serialization by ShapeDocValues
rmuir merged PR #11974: URL: https://github.com/apache/lucene/pull/11974 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on pull request #11987: Make Decompressor release memory buffer
luyuncheng commented on PR #11987: URL: https://github.com/apache/lucene/pull/11987#issuecomment-1334866892 > Yes, before we work around all that stuff here, I'd also suggest to remove those ThreadLocals. @uschindler I think this issue just have a GC path of ThreadLocals. BUT, for instance in ES, when there is a 1000-shard-nodes, and normally one shard with 40 segments per shard, one opened segments would allocate one buffer with retained heap: 100KB, so this would use 4G resident heap memory, even some segments are rarely used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org