[PR] supports force merge based on specified segments. [lucene]

2025-01-22 Thread via GitHub
cheng66551 opened a new pull request, #14163: URL: https://github.com/apache/lucene/pull/14163 In version 7.6.0 of ElasticSearch, I found through /_cat/segments that the docs.deleted count of many segments was continuously increasing, but over time, **these deleted documents were never auto

Re: [PR] feat: Added the method `forceMergeBySegmentNames` in IW, which suppor… [lucene]

2025-01-22 Thread via GitHub
cheng66551 closed pull request #14162: feat: Added the method `forceMergeBySegmentNames` in IW, which suppor… URL: https://github.com/apache/lucene/pull/14162 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[PR] feat: Added the method `forceMergeBySegmentNames` in IW, which suppor… [lucene]

2025-01-22 Thread via GitHub
cheng66551 opened a new pull request, #14162: URL: https://github.com/apache/lucene/pull/14162 In version 7.6.0 of ElasticSearch, I found through /_cat/segments that the docs.deleted count of many segments was continuously increasing, but over time, **these deleted documents were never auto

[I] UnsupportedOperationException instead of IllegalArgumentException from PointInSetQuery when values are out of order [lucene]

2025-01-22 Thread via GitHub
jhinch-at-atlassian-com opened a new issue, #14161: URL: https://github.com/apache/lucene/issues/14161 ### Description PointInSetQuery in its constructor will check if the values provided to it are in order and if not will attempt to throw an exception: ``` throw n

Re: [PR] SortedSet DV Multi Range query [lucene]

2025-01-22 Thread via GitHub
mkhludnev commented on code in PR #13974: URL: https://github.com/apache/lucene/pull/13974#discussion_r1926069638 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java: ## @@ -0,0 +1,300 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2025-01-22 Thread via GitHub
vigyasharma commented on issue #13387: URL: https://github.com/apache/lucene/issues/13387#issuecomment-2608302325 > Having a Multi-Reader on all the child log-group directories still won't provide a unified view of all group level segments associated with a Lucene Index. Even now, OpenSearc

[PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-01-22 Thread via GitHub
benwtrent opened a new pull request, #14160: URL: https://github.com/apache/lucene/pull/14160 This is a continuation and completion of the work started by @benchaplin in https://github.com/apache/lucene/pull/14085 The algorithm is fairly simple: - Only score and then explore v

Re: [PR] update privacy policy link [lucene-site]

2025-01-22 Thread via GitHub
rmuir commented on code in PR #77: URL: https://github.com/apache/lucene-site/pull/77#discussion_r1925577292 ## content/pages/privacy.md: ## @@ -1,7 +0,0 @@ -Title: Privacy Policy -URL: privacy.html -save_as: privacy.html -template: lucene/tlp/page - Review Comment: personal

Re: [PR] Add WrappingReuseStrategy for AnalyzerWrapper [lucene]

2025-01-22 Thread via GitHub
jpountz commented on PR #14154: URL: https://github.com/apache/lucene/pull/14154#issuecomment-2607429171 I don't like that `CompletionAnalyzer` needs to track a thread-local, the point of reuse strategy is to avoid this kind of thing. Also I'm not sure I understand why `CompletionAnalyzer`

Re: [PR] Remove mmap isLoaded check before madvise [lucene]

2025-01-22 Thread via GitHub
jpountz commented on PR #14156: URL: https://github.com/apache/lucene/pull/14156#issuecomment-2607399239 > Seems we just trade an isLoaded for an madvise on systems with enough memory? This is correct. I made this suggestion because it was similar to your initial proposal: skipping t

Re: [PR] Remove mmap isLoaded check before madvise [lucene]

2025-01-22 Thread via GitHub
original-brownbear commented on PR #14156: URL: https://github.com/apache/lucene/pull/14156#issuecomment-2607332400 @jpountz I see. Hmm I wonder how much that saves us? Seems we just trade an `isLoaded` for an `madvise` on systems with enough memory? That said, maybe the `madvise` is far c

Re: [PR] Add WrappingReuseStrategy for AnalyzerWrapper [lucene]

2025-01-22 Thread via GitHub
benwtrent commented on code in PR #14154: URL: https://github.com/apache/lucene/pull/14154#discussion_r1925341296 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/document/CompletionAnalyzer.java: ## @@ -112,6 +116,25 @@ public CompletionAnalyzer( Concatenate

Re: [PR] gh-14127: remove duplicate neighbors when writing HNSW graphs [lucene]

2025-01-22 Thread via GitHub
msokolov merged PR #14157: URL: https://github.com/apache/lucene/pull/14157 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] move MultiLeafKnnCollector to decorator and remove unnecessary code [lucene]

2025-01-22 Thread via GitHub
benwtrent merged PR #14147: URL: https://github.com/apache/lucene/pull/14147 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Advoid the use of ImpactsDISI when no minimum competitive score has been set [lucene]

2025-01-22 Thread via GitHub
gmarsay commented on PR #13343: URL: https://github.com/apache/lucene/pull/13343#issuecomment-2607247416 I also noticed a performance issue, maybe related to this topic? I have an index that contains data from a metricbeat agent (1 shard + 1 replica; 18G). When performing a search

Re: [PR] move MultiLeafKnnCollector to decorator and remove unnecessary code [lucene]

2025-01-22 Thread via GitHub
benwtrent commented on code in PR #14147: URL: https://github.com/apache/lucene/pull/14147#discussion_r1925317415 ## lucene/core/src/java/org/apache/lucene/search/knn/MultiLeafKnnCollector.java: ## @@ -77,6 +76,7 @@ public MultiLeafKnnCollector( int interval, Block

Re: [I] Add an optional bandwidth cap to `TieredMergePolicy`? [lucene]

2025-01-22 Thread via GitHub
jpountz commented on issue #14148: URL: https://github.com/apache/lucene/issues/14148#issuecomment-2607239803 Intuitively, I had thought of the "throttle at start" approach, where we would also give `MS` the ability to filter out some merges from `MP` (so that they don't get registered to t

Re: [I] TestManyKnnDocs is broken [lucene]

2025-01-22 Thread via GitHub
benwtrent closed issue #14149: TestManyKnnDocs is broken URL: https://github.com/apache/lucene/issues/14149 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-ma

Re: [I] Add an optional bandwidth cap to `TieredMergePolicy`? [lucene]

2025-01-22 Thread via GitHub
mikemccand commented on issue #14148: URL: https://github.com/apache/lucene/issues/14148#issuecomment-2607224591 Doing this in `MergeScheduler` (`MS`) is indeed another option. It'd mean you could cap replication bandwidth independent of your `MergePolicy` (`MP`). `MS` could even fine-tun

Re: [PR] Revert TestManyKnnDocs changes from #14084 [lucene]

2025-01-22 Thread via GitHub
benwtrent merged PR #14158: URL: https://github.com/apache/lucene/pull/14158 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Remove mmap isLoaded check before madvise [lucene]

2025-01-22 Thread via GitHub
jpountz commented on PR #14156: URL: https://github.com/apache/lucene/pull/14156#issuecomment-2607163688 Well, you may be right as well that the cost of `MS::isLoaded` is of a similar order of magnitude as `madvise`. What the current logic does is that if you get `MS::isLoaded` to frequentl

Re: [PR] gh-14127: remove duplicate neighbors when writing HNSW graphs [lucene]

2025-01-22 Thread via GitHub
iverase commented on PR #14157: URL: https://github.com/apache/lucene/pull/14157#issuecomment-2607140645 Sounds good to me @msokolov, I didn't like to add yet a new parameter in the search api. Thanks for taking the time to review it. -- This is an automated message from the Apache Git

Re: [PR] Prevent choosing connection nodes that are already neighbours [lucene]

2025-01-22 Thread via GitHub
iverase closed pull request #14159: Prevent choosing connection nodes that are already neighbours URL: https://github.com/apache/lucene/pull/14159 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] gh-14127: remove duplicate neighbors when writing HNSW graphs [lucene]

2025-01-22 Thread via GitHub
msokolov commented on PR #14157: URL: https://github.com/apache/lucene/pull/14157#issuecomment-2607133416 @iverase I see what you did there ... that would also solve this problem, but I think it is less desirable since it (1) requires extending the HNSW search API in a way I think we wouldn

Re: [PR] Remove mmap isLoaded check before madvise [lucene]

2025-01-22 Thread via GitHub
original-brownbear commented on PR #14156: URL: https://github.com/apache/lucene/pull/14156#issuecomment-2607056900 @jpountz > was introduced had a benchmark that demonstrated an improvement with the current logic Huh those results are quite unexpected I must admit :) When me

Re: [PR] Improve set deletions percentage javadoc [lucene]

2025-01-22 Thread via GitHub
msokolov merged PR #12828: URL: https://github.com/apache/lucene/pull/12828 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [I] Tool to recover data from .fdt files [LUCENE-4706] [lucene]

2025-01-22 Thread via GitHub
msokolov commented on issue #5771: URL: https://github.com/apache/lucene/issues/5771#issuecomment-2607027055 Thanks for pointing that out -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [I] Tool to recover data from .fdt files [LUCENE-4706] [lucene]

2025-01-22 Thread via GitHub
msokolov closed issue #5771: Tool to recover data from .fdt files [LUCENE-4706] URL: https://github.com/apache/lucene/issues/5771 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] update privacy policy link [lucene-site]

2025-01-22 Thread via GitHub
cpoerschke commented on code in PR #77: URL: https://github.com/apache/lucene-site/pull/77#discussion_r1925064849 ## content/pages/privacy.md: ## @@ -1,7 +0,0 @@ -Title: Privacy Policy -URL: privacy.html -save_as: privacy.html -template: lucene/tlp/page - Review Comment: Alt

[PR] update privacy policy link [lucene-site]

2025-01-22 Thread via GitHub
cpoerschke opened a new pull request, #77: URL: https://github.com/apache/lucene-site/pull/77 The "Apache Project Website Checks" at https://whimsy.apache.org/site/project/lucene identify ``` Privacy | https://lucene.apache.org/privacy.html | URL expected to match regular expr