[GitHub] [lucene] jebnix commented on issue #11870: Create a Markdown based documentation
jebnix commented on issue #11870: URL: https://github.com/apache/lucene/issues/11870#issuecomment-1372225809 @dweiss But currently, it's very hard and unintuitive to learn Lucene as a new user. In most libraries these days there is a docusaurus-like engine that generates a pretty nice and intuitive website that enables the user to find all of the beginners to intermediate information he needs about using the library, all in one unified place. That's also much more comfortable for future contributors to find the docs. Currently, the docs are spread all over the Lucene code base. That's nice when you dig in, but it makes it really hard to find out where're the docs for new users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12064: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String)
benwtrent commented on PR #12064: URL: https://github.com/apache/lucene/pull/12064#issuecomment-1372294789 Digging into it more, removing `AbstractVectorValues` will add a fair bit of extra code to the KnnVectorWriters and testing (though testing is a lesser concern I suppose). My thoughts on keeping it are that eventually, we will want to add support to binary vectors (to be used specifically with hamming distance) and half-float (or float16, admittedly, this one may wait until JVM has better float16 support). I am not sure there are other vector encodings we will want to support, but I can see Lucene supporting at least these 4 (including our byte & float32) eventually. There is already a fair bit of duplication. If the prevailing opinion is completely remove `AbstractVectorValues` and make the writers handle individual vector encodings (instead of relying on the underlying BytesRef), I will comply. What say you @rmuir && @jpountz ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #11870: Create a Markdown based documentation
dweiss commented on issue #11870: URL: https://github.com/apache/lucene/issues/11870#issuecomment-1372325036 What I meant is that this documentation should really go into modules/ classes where it belongs and can be updated/ maintained together with the code. I honestly don't believe the effort to write a separate manual will be kept in sync with the code. I am with you on many libraries having excellent documentation - it'd be great to have it. The truth is, it's a huge effort not many people will have time for (or the interest in doing, compared to writing new features or tinkering with the code). Sorry to sound so pessimistic - you're welcome to do anything you like, of course - that's the beauty of open source. Also, perhaps chatgpt can emit this automatically in a few months if you point it at the source code?... :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on issue #11354: Reuse HNSW graphs when merging segments? [LUCENE-10318]
jmazanec15 commented on issue #11354: URL: https://github.com/apache/lucene/issues/11354#issuecomment-1372817284 @msokolov here are the results re-using a single index for each experiment. Overall, there is still some variability, it seems like there is less. For the 10K results, it appears that control performed better, however, the recall is slightly worse. 10K | Exper. | time to merge (ms) | QPS | Recall | Size vec (MB) | Size vem (KB) | Size vex (MB) | | --- | - | --- | -- | -- | -- | -- | | Control 1 | 696096 | 684 | 0.979 | 512.0001 | 70.172 | 60.62953 | | Control 2 | 695400 | 724 | 0.979 | 512.0001 | 70.172 | 60.62953 | | Control 3 | 710602 | 699 | 0.979 | 512.0001 | 70.172 | 60.62953 | | Test 1 | 736711 | 649 | 0.98 | 512.0001 | 70.129 | 60.62525 | | Test 2 | 742799 | 751 | 0.98 | 512.0001 | 70.129 | 60.62525 | | Test 3 | 742263 | 746 | 0.98 | 512.0001 | 70.129 | 60.62525 | 100K | Exper. | time to merge (ms) | QPS | Recall | Size vec (MB) | Size vem (KB) | Size vex (MB) | | --- | - | --- | -- | -- | -- | -- | | Control 1 | 714349 | 689 | 0.981 | 512.0001 | 70.172 | 60.44963 | | Control 2 | 703428 | 763 | 0.981 | 512.0001 | 70.172 | 60.44963 | | Control 3 | 721943 | 666 | 0.981 | 512.0001 | 70.172 | 60.44963 | | Test 1 | 669922 | 729 | 0.981 | 512.0001 | 70.26 | 60.45246 | | Test 2 | 682579 | 729 | 0.981 | 512.0001 | 70.26 | 60.45246 | | Test 3 | 659374 | 724 | 0.981 | 512.0001 | 70.26 | 60.45246 | 500K | Exper. | time to merge (ms) | QPS | Recall | Size vec (MB) | Size vem (KB) | Size vex (MB) | | --- | - | --- | -- | -- | -- | -- | | Control 1 | 674606 | 751 | 0.98 | 512.0001 | 70.172 | 59.69535 | | Control 2 | 657207 | 699 | 0.98 | 512.0001 | 70.172 | 59.69535 | | Control 3 | 664536 | 694 | 0.98 | 512.0001 | 70.172 | 59.69535 | | Test 1 | 381532 | 793 | 0.98 | 512.0001 | 70.256 | 59.69746 | | Test 2 | 371540 | 793 | 0.98 | 512.0001 | 70.256 | 59.69746 | | Test 3 | 382440 | 800 | 0.98 | 512.0001 | 70.256 | 59.69746 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
jmazanec15 commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1062966074 ## lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java: ## @@ -94,36 +93,83 @@ public int size() { } /** - * Add node on the given level + * Add node on the given level. Nodes can be inserted out of order, but it requires that the nodes Review Comment: > but still because in L156 we need to copy the rest of array again and again as long as that is a non-appending action Right, this could be expensive for out of order insertion. I can try switching the nodeByLevel int array to a TreeSet and compare performance to https://github.com/apache/lucene/issues/11354. One complication with this approach is that the NodesIterator expects an int array: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java#L134. Given this is a public interface, we might need to either convert the treeset to an int array every time [getNodesOnLevel](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L165) gets called, or alter the NodesIterator interface to support both an int array and an Iterator produced from the TreeSet. @zhaih What do you think of this approach? Is there better way to do this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11870: Create a Markdown based documentation
rmuir commented on issue #11870: URL: https://github.com/apache/lucene/issues/11870#issuecomment-1373144282 There is some markdown processed in this way for the release: https://lucene.apache.org/core/9_4_0/ Source code is here: https://github.com/apache/lucene/blob/main/lucene/documentation/src/markdown/index.template.md I tend to agree that the long package summaries might be better as markdown, these package summaries don't necessarily get a lot of visibility via tools like IDEs. Same goes with the module overviews such as https://lucene.apache.org/core/9_4_2/core/index.html . Both of these tend to be the places with the more verbose explanations. But I also agree with some of Dawid's thoughts too. * if these summaries/overview docs are no longer javadoc but instead markdown, it would be better to allow these to be organized per-module rather than having everything in `lucene/documentation` * a little concerned about navigation: having the content in javadocs does this easy: just click "Package" or "Overview". If we markdown and javadocs, I don't know what it would feel like when browsing through it. * maintenance is a serious concern. one thing that really helps is that we run some serious javadocs linting and broken-link detector across all of our docs. helps fail the build if things are out of date. we'd at least want to make sure we still do broken-links detection across any markdown. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11870: Create a Markdown based documentation
rmuir commented on issue #11870: URL: https://github.com/apache/lucene/issues/11870#issuecomment-1373155758 The various overview.html's might even be the easiest ones to think about how markdown could work, rather than package summaries. These are currently maintained as html files, and passed to the javadoc command with `-overview file.html`. Maybe they could be maintained as README.md files instead that get preprocessed to overview.html? I like the idea that browsing lucene/core/src/java just in github would then show the overview automatically, but... all the links to any classes are gonna be broken without support javadoc, so i'm not sure of the value we get from markdown over just keeping them as html. Plus the additional .md->.html indirection would add some complexity over the current files. But I guess possibly it might be easier to contribute to? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org