[GitHub] [lucene] donnerpeter merged pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-11-08 Thread GitBox
donnerpeter merged PR #11893: URL: https://github.com/apache/lucene/pull/11893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene

[GitHub] [lucene] rmuir commented on pull request #11906: Add monster test for many knn docs

2022-11-08 Thread GitBox
rmuir commented on PR #11906: URL: https://github.com/apache/lucene/pull/11906#issuecomment-1308119525 current test still doesn't fail. checkIndex just calls nextDoc() on low-level vectors but we may need to invoke skipping to find the issue. That's my theory at least. one thing miss

[GitHub] [lucene-site] sebbASF opened a new issue, #72: Add issue tracker for website

2022-11-08 Thread GitBox
sebbASF opened a new issue, #72: URL: https://github.com/apache/lucene-site/issues/72 It would be helpful to have a link to this issue tracker from the website. Perhaps under 'Editing this site'? -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [lucene-site] uschindler merged pull request #70: Enable issues for website

2022-11-08 Thread GitBox
uschindler merged PR #70: URL: https://github.com/apache/lucene-site/pull/70 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

[GitHub] [lucene-site] uschindler merged pull request #71: Fix github page title

2022-11-08 Thread GitBox
uschindler merged PR #71: URL: https://github.com/apache/lucene-site/pull/71 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

[GitHub] [lucene] jdconrad commented on pull request #11906: Add monster test for many knn docs

2022-11-08 Thread GitBox
jdconrad commented on PR #11906: URL: https://github.com/apache/lucene/pull/11906#issuecomment-1308006136 Just as confirmation I'm seeing `FixedBitSet.clear` taking up a lot of time as well when running this test. ``` "Lucene Merge Thread #0" #18 daemon prio=5 os_prio=0 cpu=347309.

[GitHub] [lucene-site] sebbASF opened a new pull request, #71: Fix github page title

2022-11-08 Thread GitBox
sebbASF opened a new pull request, #71: URL: https://github.com/apache/lucene-site/pull/71 Github repo currently says "Apache Lucene and Solr web site" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

[GitHub] [lucene] jmazanec15 commented on issue #11354: Reuse HNSW graphs when merging segments? [LUCENE-10318]

2022-11-08 Thread GitBox
jmazanec15 commented on issue #11354: URL: https://github.com/apache/lucene/issues/11354#issuecomment-1307862709 Hi @mayya-sharipova @jtibshirani @msokolov I figured out the issue in the previous tests with the recall - I was not using the copy of the vectors when recomputing the dis

[GitHub] [lucene] benwtrent commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-08 Thread GitBox
benwtrent commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307833883 > It is a little crazy that this index has 2.5GB .vex file that, if i run zip, deflates 98% down to 75MB. very wasteful. Agreed :). Once this stuff is solved, I hope to further i

[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-08 Thread GitBox
rmuir commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307821765 With the 20M docs it still didnt fail. I have the index saved so i can play around, maybe checkindex doesnt trigger what is needed here (e.g. advance vs next). It is a little crazy

[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-08 Thread GitBox
rmuir commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307760820 @jdconrad helped with some math that may explain why previous tests didnt fail: ``` jshell> int M = 16; M ==> 16 jshell> long v1 = (1 + (M*2)) * 4 * 16268814; v1 ==> 2147

[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-08 Thread GitBox
rmuir commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307756832 Yes, if such a test works it may at least prevent similar regressions. Another possible idea is to give every vector value of 0, then zip up the index, it should be ~16MB of zeros w

[GitHub] [lucene] benwtrent commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-08 Thread GitBox
benwtrent commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307744298 @rmuir Thinking outside the box! I will try that. It would definitely cause the graph offset calculation to be completely blown out of proportion! Which is the cause of this overflow.

[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-08 Thread GitBox
rmuir commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307727467 > * In Lucene 9.2+, the bug appears when there are `16268814` (Integer.MAX_VALUE/(M * 2 + 1)) or more vectors in a single segment. If this is correct we should just be able to create

[GitHub] [lucene] gsmiller merged pull request #11881: Further optimize DrillSideways scoring

2022-11-08 Thread GitBox
gsmiller merged PR #11881: URL: https://github.com/apache/lucene/pull/11881 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

[GitHub] [lucene] jpountz commented on a diff in pull request #11900: Reduce bloom filter size by using the optimal count for hash functions.

2022-11-08 Thread GitBox
jpountz commented on code in PR #11900: URL: https://github.com/apache/lucene/pull/11900#discussion_r1016950141 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java: ## @@ -46,7 +46,9 @@ public class FuzzySet implements Accountable { public static final in

[GitHub] [lucene] jpountz commented on issue #11676: Can TimeLimitingBulkScorer exponentially grow the window size? [LUCENE-10640]

2022-11-08 Thread GitBox
jpountz commented on issue #11676: URL: https://github.com/apache/lucene/issues/11676#issuecomment-1307598183 Sorry for the confusion, I was thinking of not relying on any timing info **at all** besides the one that is already encapsulated by the `QueryTimeout` object. Just relying on the f

[GitHub] [lucene] gsmiller commented on a diff in pull request #11881: Further optimize DrillSideways scoring

2022-11-08 Thread GitBox
gsmiller commented on code in PR #11881: URL: https://github.com/apache/lucene/pull/11881#discussion_r1016914939 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java: ## @@ -166,89 +160,158 @@ public int score(LeafCollector collector, Bits acceptDocs, int m

[GitHub] [lucene] rmuir commented on issue #11676: Can TimeLimitingBulkScorer exponentially grow the window size? [LUCENE-10640]

2022-11-08 Thread GitBox
rmuir commented on issue #11676: URL: https://github.com/apache/lucene/issues/11676#issuecomment-1307533642 It is worth it. nobody wants to debug test failures that happen because NTP skewed the clock. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [lucene] jpountz commented on issue #11676: Can TimeLimitingBulkScorer exponentially grow the window size? [LUCENE-10640]

2022-11-08 Thread GitBox
jpountz commented on issue #11676: URL: https://github.com/apache/lucene/issues/11676#issuecomment-1307486547 I wonder if the complexity introduced by the nanotime trick is worth the benefits, but I'm happy to discuss it over a PR. In my opinion only exceeding the configured allowed timeout

[GitHub] [lucene] rmuir commented on pull request #11906: Add monster test for many knn docs

2022-11-08 Thread GitBox
rmuir commented on PR #11906: URL: https://github.com/apache/lucene/pull/11906#issuecomment-1307432615 I looked into why the test is taking eternity to run, the super slow merge at the end is spending all its time clearing bitsets! Looks like the wrong datastructure... ``` java.la

[GitHub] [lucene] iverase merged pull request #11907: Fix latent casting bug in BKDWriter

2022-11-08 Thread GitBox
iverase merged PR #11907: URL: https://github.com/apache/lucene/pull/11907 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-08 Thread GitBox
rmuir commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307289709 > Yeah, we can probably trigger this overflow by using 16268815 byte vectors of few dimensions. Something as small as 2 dimensions could work. > One issue with HNSW is that completel

[GitHub] [lucene] iverase commented on pull request #11907: Fix latent casting bug in BKDWriter

2022-11-08 Thread GitBox
iverase commented on PR #11907: URL: https://github.com/apache/lucene/pull/11907#issuecomment-1307282609 Actually, I think there are more occurrences of this multiplication without check, could we add it? for example: https://github.com/apache/lucene/blob/3210a42f0958e395930d2259e155a7149fb

[GitHub] [lucene] benwtrent commented on pull request #11907: Fix latent casting bug in BKDWriter

2022-11-08 Thread GitBox
benwtrent commented on PR #11907: URL: https://github.com/apache/lucene/pull/11907#issuecomment-1307246646 @iverase you might be interested in this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[GitHub] [lucene] benwtrent opened a new pull request, #11907: Fix latent casting bug in BKDWriter

2022-11-08 Thread GitBox
benwtrent opened a new pull request, #11907: URL: https://github.com/apache/lucene/pull/11907 This commit fixes a latent casting bug where int multiplication could roll-over to the negatives. `new byte[Math.toIntExact(numSplits * config.bytesPerDim)];` `toIntExact` does nothin

[GitHub] [lucene] dweiss commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-08 Thread GitBox
dweiss commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307230679 There's a whole bunch of automated checks you could go through, selectively, and try to enable them for the future. This includes IntLongMath, which is currently off. https://gith

[GitHub] [lucene] rmuir commented on pull request #11852: Luke Webapp

2022-11-08 Thread GitBox
rmuir commented on PR #11852: URL: https://github.com/apache/lucene/pull/11852#issuecomment-1307222420 > Re: JS frameworks - I recognize my position is from Ludd, and it might be untenable. If it gets out of hand we can always add something like jQuery, but we can never remove, so let's sta

[GitHub] [lucene] rmuir commented on pull request #11852: Luke Webapp

2022-11-08 Thread GitBox
rmuir commented on PR #11852: URL: https://github.com/apache/lucene/pull/11852#issuecomment-1307217205 > I'm late to the party. Do we really want to have/maintain a web application under Lucene? An HTTP server would not be sufficient to develop a state-full web app, you need to write an app

[GitHub] [lucene] benwtrent commented on pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-08 Thread GitBox
benwtrent commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1307208338 > We have to start building up tests for these cases because this seems like deja vu as far as int overflows in this area. I am right there with ya @rmuir. 100% feels like "whack

[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-08 Thread GitBox
thecoop commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1307164489 To be clear, are you referring to the extra memory used by the deduplication hashmap for the duration of the deserialisation, that will then be eligible for GC after the method returns?

[GitHub] [lucene] rmuir commented on pull request #11906: Add monster test for many knn docs

2022-11-08 Thread GitBox
rmuir commented on PR #11906: URL: https://github.com/apache/lucene/pull/11906#issuecomment-1307150684 i bumped the ram and restarted the test. but it is really broken that i can flush out all the docs with a 512MB heap, but need many many gigabytes to merge them together. and its only 16 m

[GitHub] [lucene] rmuir commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-08 Thread GitBox
rmuir commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1307115506 yes because it would translate as a leak for many other use-cases/applications. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

[GitHub] [lucene] scampi commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]

2022-11-08 Thread GitBox
scampi commented on issue #11702: URL: https://github.com/apache/lucene/issues/11702#issuecomment-1307096355 I was involved in a [previous issue](https://issues.apache.org/jira/browse/LUCENE-10449) that is related to this one. The problem was a drop of performance when scanning `SortedSetD

[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-08 Thread GitBox
thecoop commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1307070854 Unfortunately that doesn't seem to have much of an impact, from what I can see here. @rmuir Would you be against having a string cache specifically in the relevant methods in Fiel

[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

2022-11-08 Thread GitBox
thecoop commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1307057810 Unfortunately that doesn't seem to have much of an effect - same number after a GC, with the option turned on or off -- This is an automated message from the Apache Git Service. To res