[PR] Updated heuristic to remove non diverse edges keeping overall graph c… [lucene]

2023-11-08 Thread via GitHub
nitirajrathore opened a new pull request, #12783: URL: https://github.com/apache/lucene/pull/12783 …onnected. No test cases, unoptimized, draft only version. ### Description Details in this comment : https://github.com/apache/lucene/issues/12627#issuecomment-1801982741 I wil

Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-08 Thread via GitHub
nitirajrathore commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1802424885 @benwtrent : I have added draft PR. The code is not at all optimized right now for performance and I am hoping to fix some obvious stuff and will post perf results here.

[PR] Speed up BytesRefHash#sort (another approach) [lucene]

2023-11-08 Thread via GitHub
gf2121 opened a new pull request, #12784: URL: https://github.com/apache/lucene/pull/12784 Following https://github.com/apache/lucene/pull/12775, this PR tries another approach to speed up `BytesRefHash#sort`: The idea is that since we have extra ints in this map, we can cache the bucket

Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-08 Thread via GitHub
gf2121 commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1802480151 I came up with https://github.com/apache/lucene/pull/12784 as another idea to speed up `BytesRefHash#sort`, which has been shown to have performance improvements running on Intel chips.

Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-08 Thread via GitHub
benwtrent merged PR #12780: URL: https://github.com/apache/lucene/pull/12780 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Clean up ordinal map in default SSDV reader state [lucene]

2023-11-08 Thread via GitHub
gsmiller commented on PR #12454: URL: https://github.com/apache/lucene/pull/12454#issuecomment-1802519348 > @gsmiller, I think this PR is ready. Is there anything else you'd like to see changed? Gah! I'm sorry I missed this. I'll have a look here shortly. Apologies again. -- This i

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-08 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1802705393 @kevindrosendahl if I am reading the code correctly, it does the following: - Write int8 quantized vectors along side the vector ordinals in the graph (`.vex` or whatever h

Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub
dungba88 commented on PR #12778: URL: https://github.com/apache/lucene/pull/12778#issuecomment-1802753207 Thank you for reproducing this! I found the bug, it's quite silly. The node address is the last address, so I should have do this ``` copiedNodes.append(fallbackTable.cop

[PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope [lucene]

2023-11-08 Thread via GitHub
uschindler opened a new pull request, #12785: URL: https://github.com/apache/lucene/pull/12785 Unfortunately the solution in #12707 was not working well with concurrency. The is alive status of `MemorySegment.Scope` may be stale. In that case the `IllegalStateException` was catched, but the

Re: [PR] MMapDirectory with MemorySegment: Confirm that scope/session is no longer alive before throwing AlreadyClosedException [lucene]

2023-11-08 Thread via GitHub
uschindler commented on PR #12707: URL: https://github.com/apache/lucene/pull/12707#issuecomment-1802765955 This code did not work well as the `isAlive` status may be stale in other threads. I reworked this one here: #12785 -- This is an automated message from the Apache Git Service. To r

[PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub
dungba88 opened a new pull request, #12786: URL: https://github.com/apache/lucene/pull/12786 ### Description See the previous PR: https://github.com/apache/lucene/pull/12778 There was a bug in the PR, the copiedNodeAddress is the last address (inclusively) of the node, thus the

Re: [PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope [lucene]

2023-11-08 Thread via GitHub
uschindler commented on PR #12785: URL: https://github.com/apache/lucene/pull/12785#issuecomment-1802841410 The bug in JDK is here: https://bugs.openjdk.org/browse/JDK-8319756 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub
dungba88 commented on PR #12786: URL: https://github.com/apache/lucene/pull/12786#issuecomment-1802854503 > Could not copy file '/home/runner/work/lucene/lucene/lucene/JRE_VERSION_MIGRATION.md' to '/home/runner/work/lucene/lucene/lucene/documentation/build/site/JRE_VERSION_MIGRATION.html'.

Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub
dungba88 commented on code in PR #12786: URL: https://github.com/apache/lucene/pull/12786#discussion_r1387298308 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -289,21 +273,38 @@ public long getNodeAddress(long hashSlot) { } /** - * Set t

Re: [PR] Clean up ordinal map in default SSDV reader state [lucene]

2023-11-08 Thread via GitHub
gsmiller merged PR #12454: URL: https://github.com/apache/lucene/pull/12454 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387436327 ## lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/SearchWithCollectorTask.java: ## @@ -45,20 +43,6 @@ public boolean withCollector() { return

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387438590 ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -232,8 +232,8 @@ public static TopDocs merge( /** * Returns a new TopFieldDocs, containing

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387443316 ## lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java: ## @@ -174,7 +173,7 @@ private static boolean canEarlyTerminateOnPrefix(Sort searchSort, Sort

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387449164 ## lucene/core/src/java/org/apache/lucene/search/TopFieldCollectorManager.java: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387450681 ## lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java: ## @@ -429,106 +432,29 @@ public static TopFieldCollector create(Sort sort, int numHits, int

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387453179 ## lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java: ## @@ -44,7 +43,7 @@ public void setScorer(Scorable scorer) throws IOException { } Re

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387453543 ## lucene/core/src/test/org/apache/lucene/document/BaseSpatialTestCase.java: ## @@ -695,8 +695,8 @@ protected void verifyRandomDistanceQueries(IndexReader reader, Obj

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub
zacharymorn commented on PR #240: URL: https://github.com/apache/lucene/pull/240#issuecomment-1803126121 > > We are still a ways away (from seeing Lucene fully utilize available hardware concurrency available at search time to reduce query latencies) > > For example: query concurrency

Re: [PR] Speed up BytesRefHash#sort (another approach) [lucene]

2023-11-08 Thread via GitHub
gf2121 commented on PR #12784: URL: https://github.com/apache/lucene/pull/12784#issuecomment-1803144990 Even faster than the original approach on M2: ``` BASELINE: sort 5169965 terms, build histogram took: 489ms, reorder took: 1359ms, total took: 2381ms. BASELINE: sort 5169965 ter

Re: [PR] Skip docs with Docvalues in NumericLeafComparator [lucene]

2023-11-08 Thread via GitHub
LuXugang merged PR #12405: URL: https://github.com/apache/lucene/pull/12405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [I] Skip docs with Docvalues in NumericLeafComparator [lucene]

2023-11-08 Thread via GitHub
LuXugang closed issue #12401: Skip docs with Docvalues in NumericLeafComparator URL: https://github.com/apache/lucene/issues/12401 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [I] Multiple ClassNotFoundExceptions in IntelliJ Fat Jar on ARM64 Java 20 [lucene]

2023-11-08 Thread via GitHub
davido commented on issue #12307: URL: https://github.com/apache/lucene/issues/12307#issuecomment-1803192152 @uschindler We are using [Bazel](https://bazel.build) build system, and merging the two JARs like this: # Merge jars so # META-INF/services/org.apache.lucene.code

Re: [PR] Enable executing using NFA in RegexpQuery [lucene]

2023-11-08 Thread via GitHub
zhaih commented on code in PR #12767: URL: https://github.com/apache/lucene/pull/12767#discussion_r1387529653 ## lucene/core/src/test/org/apache/lucene/search/TestRegexpQuery.java: ## @@ -80,7 +80,10 @@ private long caseInsensitiveRegexQueryNrHits(String regex) throws IOExcepti

Re: [PR] Speed up BytesRefHash#sort (another approach) [lucene]

2023-11-08 Thread via GitHub
gf2121 commented on PR #12784: URL: https://github.com/apache/lucene/pull/12784#issuecomment-1803223146 As "reorder" gets faster, I'm considering lowering the fallback threshold and letting radix sort do more of the work. ### Benchmark result: MAC Intel ``` BASELINE: sor

Re: [PR] speedup arm int functions? [lucene]

2023-11-08 Thread via GitHub
rmuir closed pull request #12743: speedup arm int functions? URL: https://github.com/apache/lucene/pull/12743 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

Re: [PR] speedup arm int functions? [lucene]

2023-11-08 Thread via GitHub
rmuir commented on PR #12743: URL: https://github.com/apache/lucene/pull/12743#issuecomment-1803265618 speeds up as many machines as it slows down. cascadelake: `['0', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz', '1', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 827

[PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-08 Thread via GitHub
rmuir opened a new pull request, #12787: URL: https://github.com/apache/lucene/pull/12787 This saves me a lot of time and prevents making bad changes that help some cpus and hurt others. Case in point: #12743 You run a command such as: ``` make PATCH_BRANCH=rmuir:some-spe

Re: [PR] Speed up BytesRefHash#sort (another approach) [lucene]

2023-11-08 Thread via GitHub
gf2121 commented on PR #12784: URL: https://github.com/apache/lucene/pull/12784#issuecomment-1803319176 I try this approach with `wikimedium10m` on the M2 mac, the sort took sum decreased ~60%. Details https://bytedance.larkoffice.com/sheets/XfVCsZL5phx9letbDEQcaw0snnf"; dat

Re: [I] Multiple ClassNotFoundExceptions in IntelliJ Fat Jar on ARM64 Java 20 [lucene]

2023-11-08 Thread via GitHub
uschindler commented on issue #12307: URL: https://github.com/apache/lucene/issues/12307#issuecomment-1803320694 Hi, this has nothing to do with the minimum Java version in your Java built, it only has to do with the runtime Java version. If you use Java 19 or later, the classloader need

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
rmuir commented on PR #12787: URL: https://github.com/apache/lucene/pull/12787#issuecomment-1803343093 When I run `make PATCH_BRANCH=rmuir:microbenchmark_ec2` we will just see no differences but it demonstrates it (sorry: no speedups in this branch!). It spins up/tears down `lucene-jm

Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-09 Thread via GitHub
gf2121 commented on code in PR #12748: URL: https://github.com/apache/lucene/pull/12748#discussion_r1387636706 ## lucene/CHANGES.txt: ## @@ -106,6 +106,8 @@ Optimizations * GITHUB#12552: Make FSTPostingsFormat load FSTs off-heap. (Tony X) +* GITHUB#12748: Specialize arc sto

Re: [PR] remove non-NRT replication support [lucene]

2023-11-09 Thread via GitHub
dweiss commented on PR #12038: URL: https://github.com/apache/lucene/pull/12038#issuecomment-1803391323 > If anyone is still using the legacy non-NRT mode, please let me know on this issue and give me your IP address, so I can try to pop a shell. Oh, I missed this bit somehow, @rmuir.

[I] Unrolle or vectorize Math.max in CompetitiveImpactAccumulator.addAll? [lucene]

2023-11-09 Thread via GitHub
vsop-479 opened a new issue, #12788: URL: https://github.com/apache/lucene/issues/12788 ### Description Does it worth to make Math.max in CompetitiveImpactAccumulator.addAll unrolled or vectorized? Maybe scalar can be auto vectorized by JIT, but there is some speed up with unrolle

Re: [I] Unrolle or vectorize Math.max in CompetitiveImpactAccumulator.addAll? [lucene]

2023-11-09 Thread via GitHub
vsop-479 commented on issue #12788: URL: https://github.com/apache/lucene/issues/12788#issuecomment-1803464509 @jpountz Please take a look when you get a chance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [I] Unrolle or vectorize Math.max in CompetitiveImpactAccumulator.addAll? [lucene]

2023-11-09 Thread via GitHub
uschindler commented on issue #12788: URL: https://github.com/apache/lucene/issues/12788#issuecomment-1803537830 Hi, for correct vectorization please make use of the official Lucene framework (add your implementation class' instance for the scalar and the vectorized variant as a sepa

Re: [PR] Clean up ordinal map in default SSDV reader state [lucene]

2023-11-09 Thread via GitHub
stefanvodita commented on PR #12454: URL: https://github.com/apache/lucene/pull/12454#issuecomment-1803605667 Thanks Greg! I think the delay is partially my fault, I had mentioned a different G. Miller in my message 😄 -- This is an automated message from the Apache Git Service. To respon

Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-09 Thread via GitHub
mikemccand commented on PR #12748: URL: https://github.com/apache/lucene/pull/12748#issuecomment-1803615768 > I can help merge this in and backport if there is no objection in 48h. Thanks @gf2121 -- we should backport all these recent exciting FST changes in the right order as a batch

Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-09 Thread via GitHub
mikemccand merged PR #12748: URL: https://github.com/apache/lucene/pull/12748 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [PR] Clean up ordinal map in default SSDV reader state [lucene]

2023-11-09 Thread via GitHub
mikemccand commented on PR #12454: URL: https://github.com/apache/lucene/pull/12454#issuecomment-1803628190 > Thanks Greg! I think the delay is partially my fault, I had mentioned a different G. Miller in my message 😄 Seems to be common mistake recently! See this [recent hilarious e

Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-09 Thread via GitHub
easyice commented on PR #12748: URL: https://github.com/apache/lucene/pull/12748#issuecomment-1803649666 @mikemccand @gf2121 Thanks for review and merge it ;-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope [lucene]

2023-11-09 Thread via GitHub
uschindler commented on PR #12785: URL: https://github.com/apache/lucene/pull/12785#issuecomment-1803669858 After some discussion with @mcimadamore we figured out that there are more problem, so we need to rely on the exception message. The following problem can occur and possibly hap

Re: [PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope [lucene]

2023-11-09 Thread via GitHub
uschindler commented on PR #12785: URL: https://github.com/apache/lucene/pull/12785#issuecomment-1803692612 I committed another change to make the sequence of `IndexInput#close()` first try to close the segment and then set everything to null. In case if ISE, the IndexInput is not closed.

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

2023-11-09 Thread via GitHub
epotyom commented on code in PR #12769: URL: https://github.com/apache/lucene/pull/12769#discussion_r1387902988 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java: ## @@ -476,6 +479,86 @@ public void testOpenIfChangedReplaceTaxon

Re: [I] Take advantage of bloom filter when delete terms [lucene]

2023-11-09 Thread via GitHub
s1monw commented on issue #12725: URL: https://github.com/apache/lucene/issues/12725#issuecomment-1803718796 yeah I think we should check if it's memory and time efficient. I think in theory we could iterate the terms in the automaton against the bloom filter to take advantage of it inside

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-11-09 Thread via GitHub
mikemccand commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1803774655 > Really, if we'd be implementing the feature today would we use a bitset or maybe a sparse DV field recording the number of children for each block in the index? In fact, in o

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

2023-11-09 Thread via GitHub
mikemccand commented on PR #12769: URL: https://github.com/apache/lucene/pull/12769#issuecomment-1803988678 > I can make sure that we have a task that calls this method (indirectly) in the next step for this issue - adding bulk Facets#getSpecificValues, will that be ok? +1, thanks!

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

2023-11-09 Thread via GitHub
mikemccand commented on code in PR #12769: URL: https://github.com/apache/lucene/pull/12769#discussion_r1388134130 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java: ## @@ -570,16 +654,20 @@ public void testAccountable() throws

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

2023-11-09 Thread via GitHub
mikemccand merged PR #12769: URL: https://github.com/apache/lucene/pull/12769 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

[PR] Improve vector search speed by using FixedBitSet [lucene]

2023-11-09 Thread via GitHub
benwtrent opened a new pull request, #12789: URL: https://github.com/apache/lucene/pull/12789 While doing some performance testing and digging into flamegraphs, I noticed for smaller vectors (96dim float32), we were losing a fair bit of time within the `SparseFixedBitSet#getAndSet` method.

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

2023-11-09 Thread via GitHub
mikemccand commented on PR #12769: URL: https://github.com/apache/lucene/pull/12769#issuecomment-1804048913 I think this is safe to backport to 9.x? I'll do that, and move the `CHANGES.txt` entry down. -- This is an automated message from the Apache Git Service. To respond to the message

Re: [PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope [lucene]

2023-11-09 Thread via GitHub
uschindler commented on PR #12785: URL: https://github.com/apache/lucene/pull/12785#issuecomment-1804063179 I fixed the `close()` method to no longer throw `IllegalStateException` as this would violate the contract. When we close only `IOException` is allowed. As half-open index inputs are

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
rmuir commented on PR #12787: URL: https://github.com/apache/lucene/pull/12787#issuecomment-1804143223 I still struggle with the noise, it is even more than when you run the benchmarks manually. I inspected an instance under test and saw e.g. scheduled job burning up CPU rebuilding m

Re: [PR] Improve vector search speed by using FixedBitSet [lucene]

2023-11-09 Thread via GitHub
jpountz commented on PR #12789: URL: https://github.com/apache/lucene/pull/12789#issuecomment-1804146598 I can believe that FixedBitSet is faster in some cases, but it's surprising to me that the memory usage of SparseFixedBitSet can go up to 2x that of FixedBitSet, this makes me wonder if

Re: [PR] Fix CheckIndex to detect major corruption with old (not the latest) commit point [lucene]

2023-11-09 Thread via GitHub
gokaai commented on code in PR #12530: URL: https://github.com/apache/lucene/pull/12530#discussion_r1388271749 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -610,6 +610,31 @@ public Status checkIndex(List onlySegments, ExecutorService executorServ

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-09 Thread via GitHub
jpountz commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1388273135 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVintWriter.java: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [I] Unrolle or vectorize Math.max in CompetitiveImpactAccumulator.addAll? [lucene]

2023-11-09 Thread via GitHub
jpountz commented on issue #12788: URL: https://github.com/apache/lucene/issues/12788#issuecomment-1804189940 Oh, it's sad that this loop doesn't get auto-vectorized automatically. Out of curiosity, are you seeing it show up in some benchmarks? -- This is an automated message from the Apa

Re: [PR] Fix CheckIndex to detect major corruption with old (not the latest) commit point [lucene]

2023-11-09 Thread via GitHub
gokaai commented on code in PR #12530: URL: https://github.com/apache/lucene/pull/12530#discussion_r1388271749 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -610,6 +610,31 @@ public Status checkIndex(List onlySegments, ExecutorService executorServ

Re: [I] Reproducible failure in TestIndexWriter.testHasUncommittedChanges [lucene]

2023-11-09 Thread via GitHub
jpountz commented on issue #12763: URL: https://github.com/apache/lucene/issues/12763#issuecomment-1804193016 I'm away from my main working computer this week, I suspect it's a similar issue that I saw elsewhere where merges cascade. I'll look into it on Monday if nobody beats me to me. -

Re: [PR] Improve vector search speed by using FixedBitSet [lucene]

2023-11-09 Thread via GitHub
benwtrent commented on PR #12789: URL: https://github.com/apache/lucene/pull/12789#issuecomment-1804203048 @jpountz I re-ran my tests and double checked my numbers, I have some corrections, I accidentally double-counted sparse sizes, so previous numbers are 2x too big. GLOVE-100-100_

Re: [PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope and make sure IndexInput#close() does not throw IllegalStateException and waits instead [lucene]

2023-11-09 Thread via GitHub
uschindler commented on PR #12785: URL: https://github.com/apache/lucene/pull/12785#issuecomment-1804242819 I let `TestMmapDirectory.testAceWithThreads` run with `gradlew :lucene:core:beast` with many iterations and high multiplier: JDK 19, 20, 21 showed no problems. -- This is an automa

Re: [PR] Adding new flat vector format and refactoring HNSW [lucene]

2023-11-09 Thread via GitHub
jimczi commented on PR #12729: URL: https://github.com/apache/lucene/pull/12729#issuecomment-1804243772 Sorry for the late reply. > Since this is a larger API discussion, do we think we can move forward with the way it is now (quantization for HNSW and other vector indices) and itera

Re: [PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope and make sure IndexInput#close() does not throw IllegalStateException and waits instead [lucene]

2023-11-09 Thread via GitHub
uschindler merged PR #12785: URL: https://github.com/apache/lucene/pull/12785 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [PR] Fix CheckIndex to detect major corruption with old (not the latest) commit point [lucene]

2023-11-09 Thread via GitHub
mikemccand commented on code in PR #12530: URL: https://github.com/apache/lucene/pull/12530#discussion_r1388366133 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -610,6 +610,31 @@ public Status checkIndex(List onlySegments, ExecutorService executorServ

Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]

2023-11-09 Thread via GitHub
uschindler commented on issue #12180: URL: https://github.com/apache/lucene/issues/12180#issuecomment-1804313295 Hi, the commit causes test failures like this from time to time: ``` org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader > testGetPathAndOrdinalsRandomMul

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

2023-11-09 Thread via GitHub
uschindler commented on PR #12769: URL: https://github.com/apache/lucene/pull/12769#issuecomment-1804314417 Hi, the commit causes test failures like this from time to time: ``` org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader > testGetPathAndOrdinalsRandomMultithr

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

2023-11-09 Thread via GitHub
uschindler commented on PR #12769: URL: https://github.com/apache/lucene/pull/12769#issuecomment-1804316973 Looks like the ordinals array sizes must be at least 1, so in general the initial setup of the ordinal size must use `numOrdinals = random(limit) + 1;` -- This is an automated messa

Re: [PR] Refactoring HNSW to use a new internal FlatVectorFormat [lucene]

2023-11-09 Thread via GitHub
benwtrent commented on PR #12729: URL: https://github.com/apache/lucene/pull/12729#issuecomment-1804353255 @jimczi updated the title. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

2023-11-09 Thread via GitHub
mikemccand commented on PR #12769: URL: https://github.com/apache/lucene/pull/12769#issuecomment-1804396963 Thanks Uwe and sorry! I think Egor is digging on this or I’ll revert soon. Mike On Thu, Nov 9, 2023 at 1:17 PM Uwe Schindler ***@***.***> wrote: > Assigned #127

[PR] Fix random test TestDirectoryTaxonomyReader#TestDirectoryTaxonomyReader [lucene]

2023-11-09 Thread via GitHub
epotyom opened a new pull request, #12790: URL: https://github.com/apache/lucene/pull/12790 Fix bug from https://github.com/apache/lucene/pull/12769 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

2023-11-09 Thread via GitHub
epotyom commented on PR #12769: URL: https://github.com/apache/lucene/pull/12769#issuecomment-1804432895 Hi all, Sorry for the bug, this pull request should fix it: https://github.com/apache/lucene/pull/12790 Kind regards, Egor On Thu, 9 Nov 2023 at 18:52, Michael M

Re: [PR] Fix random test TestDirectoryTaxonomyReader#TestDirectoryTaxonomyReader [lucene]

2023-11-09 Thread via GitHub
mikemccand merged PR #12790: URL: https://github.com/apache/lucene/pull/12790 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]

2023-11-09 Thread via GitHub
mikemccand commented on issue #12180: URL: https://github.com/apache/lucene/issues/12180#issuecomment-1804446870 OK fixed @uschindler -- sorry! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]

2023-11-09 Thread via GitHub
mikemccand commented on issue #12180: URL: https://github.com/apache/lucene/issues/12180#issuecomment-1804447251 And thanks @epotyom! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]

2023-11-09 Thread via GitHub
gsmiller commented on issue #12180: URL: https://github.com/apache/lucene/issues/12180#issuecomment-1804469669 Thanks @epotyom! Should we consider a follow up PR that leverages this new bulk lookup by adding something like `Facets#getSpecificValues` that gets facet values for multiple paths

Re: [PR] Fix random test TestDirectoryTaxonomyReader#testGetPathAndOrdinalsRandomMultithreading [lucene]

2023-11-09 Thread via GitHub
epotyom commented on PR #12790: URL: https://github.com/apache/lucene/pull/12790#issuecomment-1804531938 I've re-run the tests multiple times just in case, there were no errors: ``` ./gradlew -p lucene/facet test --tests "*TestDirectoryTaxonomyReader*" -Ptests.iters=1000 ...

Re: [PR] Fix random test TestDirectoryTaxonomyReader#testGetPathAndOrdinalsRandomMultithreading [lucene]

2023-11-09 Thread via GitHub
uschindler commented on PR #12790: URL: https://github.com/apache/lucene/pull/12790#issuecomment-1804593054 I need to merge this also into the java 22 mmap branch where Jenkins runs on. #12706 -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]

2023-11-09 Thread via GitHub
epotyom commented on issue #12180: URL: https://github.com/apache/lucene/issues/12180#issuecomment-1804594098 @gsmiller yes, I'll be working on that now as well as adding benchmark task for getSpecificValues, as was discussed with Mike in https://github.com/apache/lucene/pull/12769#pullrequ

Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]

2023-11-09 Thread via GitHub
gsmiller commented on issue #12180: URL: https://github.com/apache/lucene/issues/12180#issuecomment-1804667028 @epotyom got it, thanks! Didn't see that earlier conversation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-09 Thread via GitHub
rmuir commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1388595300 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVintReader.java: ## @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one o

Re: [PR] Fix random test TestDirectoryTaxonomyReader#testGetPathAndOrdinalsRandomMultithreading [lucene]

2023-11-09 Thread via GitHub
uschindler commented on PR #12790: URL: https://github.com/apache/lucene/pull/12790#issuecomment-1804722810 OK merged to java 22 branch. Tests pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [I] Unrolle or vectorize Math.max in CompetitiveImpactAccumulator.addAll? [lucene]

2023-11-09 Thread via GitHub
rmuir commented on issue #12788: URL: https://github.com/apache/lucene/issues/12788#issuecomment-1804722948 > Oh, it's sad that this loop doesn't get auto-vectorized automatically. Out of curiosity, are you seeing it show up in some benchmarks I don't believe that, there is code to do

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
uschindler commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388633378 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
rmuir commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388637999 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude anyt

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
rmuir commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388639032 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude anyt

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
uschindler commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388639477 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
rmuir commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388642669 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude anyt

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
uschindler commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388643017 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
uschindler commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388643017 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
rmuir commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388647325 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude anyt

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
uschindler commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388642702 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
uschindler commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388648419 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-09 Thread via GitHub
uschindler commented on code in PR #12787: URL: https://github.com/apache/lucene/pull/12787#discussion_r1388668515 ## gradle/validation/rat-sources.gradle: ## @@ -53,6 +53,9 @@ allprojects { include "**/*.sh" include "**/*.bat" +// exclude

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-09 Thread via GitHub
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1804858072 Perhaps much of the jvector performance improvement is simply from on heap caching. https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/git

Re: [I] Unrolle or vectorize Math.max in CompetitiveImpactAccumulator.addAll? [lucene]

2023-11-09 Thread via GitHub
vsop-479 commented on issue #12788: URL: https://github.com/apache/lucene/issues/12788#issuecomment-1804999164 > To benchmark then use the benchmark-jmh Gradle module. This will enable vectorization if all is sane. Thanks for your explanation. I will try it. > are you seeing it

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-09 Thread via GitHub
easyice commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1388843109 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVintWriter.java: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

<    19   20   21   22   23   24   25   26   27   28   >