[jira] [Updated] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9201: Attachment: javadocHTML5.png javadocHTML4.png javadocGRADLE.png > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032105#comment-17032105 ] Robert Muir commented on LUCENE-9201: - I looked in more on the package.html/overview.html. This seems to be purely a gradle issue of not passing all the parameters to "javadocs". I tested 3 cases: * HTML4 frames output, java 8 * HTML5 output, java 11 * gradle (HTML5) output, java 11 !javadocGRADLE.png! !javadocHTML4.png! !javadocHTML5.png! > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032105#comment-17032105 ] Robert Muir edited comment on LUCENE-9201 at 2/7/20 4:51 AM: - I looked in more on the package.html/overview.html. This seems to be purely a gradle issue of not passing all the parameters to "javadocs". I tested 3 cases: * HTML4 frames output, java 8 !javadocHTML4.png! * HTML5 output, java 11 !javadocHTML5.png! * gradle (HTML5) output, java 11 !javadocGRADLE.png! was (Author: rcmuir): I looked in more on the package.html/overview.html. This seems to be purely a gradle issue of not passing all the parameters to "javadocs". I tested 3 cases: * HTML4 frames output, java 8 * HTML5 output, java 11 * gradle (HTML5) output, java 11 !javadocGRADLE.png! !javadocHTML4.png! !javadocHTML5.png! > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032109#comment-17032109 ] Robert Muir commented on LUCENE-9201: - The missing overview.html is caused by bugs in the defaults-javadoc.gradle code: {code} opts.overview = file("src/main/java/overview.html").toString() {code} It points to the wrong place for most lucene modules which are {{src/java}} not {{src/main/java}}. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032110#comment-17032110 ] Robert Muir commented on LUCENE-9201: - I will push the obvious fix to master, but it would be great to improve the gradle code. I think we need explicit file exists check so that the build fails clearly if the overview.html is missing. It should be present for any of these artifacts. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032111#comment-17032111 ] ASF subversion and git services commented on LUCENE-9201: - Commit a77bb1e6f57ed21d484c3927d710679166918878 in lucene-solr's branch refs/heads/master from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a77bb1e ] LUCENE-9201: add overview.html from correct location to the javadocs in gradle build > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9210) gradle javadocs doesn't incorporate CSS/JS
[ https://issues.apache.org/jira/browse/LUCENE-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032115#comment-17032115 ] Robert Muir commented on LUCENE-9210: - The syntax highlighting sure makes the code snippets easy on the eyes. The ant build accomplishes this by concatenating additional CSS and JS code directly to the output files. Maybe there is a less evil way? > gradle javadocs doesn't incorporate CSS/JS > -- > > Key: LUCENE-9210 > URL: https://issues.apache.org/jira/browse/LUCENE-9210 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > We add to the javadoc css/jss some stuff: > * Prettify.css/js (syntax highlighting) > * a few styles to migrate table cellpadding: LUCENE-9209 > The ant task concatenates the stuff to the end of the resulting javadocs > css/js. > We should either do this also in the gradle build or remove our reliance on > this stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032119#comment-17032119 ] Robert Muir commented on LUCENE-9201: - FYI I also discovered LUCENE-9210 as a part of investigations here, it is another TODO. With ant, the additional custom js/css syntax-highlights the sample code. Also some CSS classes are used for the HTML5 transition. It only impacts presentation, so it doesn't cause failures, but it would be good to fix gradle to use these. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14066) Deprecate DIH
[ https://issues.apache.org/jira/browse/SOLR-14066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031385#comment-17031385 ] Jan Høydahl commented on SOLR-14066: [~rohitcse] are you still willing to maintain DIH? What support do you need from the community to get started? > Deprecate DIH > - > > Key: SOLR-14066 > URL: https://issues.apache.org/jira/browse/SOLR-14066 > Project: Solr > Issue Type: Improvement > Components: contrib - DataImportHandler >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Attachments: image-2019-12-14-19-58-39-314.png > > Time Spent: 40m > Remaining Estimate: 0h > > DataImportHandler has outlived its utility. DIH doesn't need to remain inside > Solr anymore. Plan is to deprecate DIH in 8.5, remove from 9.0. Also, work on > handing it off to volunteers in the community (so far, [~rohitcse] has > volunteered to maintain it). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap
[ https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031412#comment-17031412 ] ASF subversion and git services commented on LUCENE-9147: - Commit fdf5ade727ea8a5a6232d421a33b3fa1495d93b3 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fdf5ade ] LUCENE-9147: Fix codec excludes. > Move the stored fields index off-heap > - > > Key: LUCENE-9147 > URL: https://issues.apache.org/jira/browse/LUCENE-9147 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Now that the terms index is off-heap by default, it's almost embarrassing > that many indices spend most of their memory usage on the stored fields index > or the term vectors index, which are much less performance-sensitive than the > terms index. We should move them off-heap too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap
[ https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031411#comment-17031411 ] ASF subversion and git services commented on LUCENE-9147: - Commit 3246b2605869549dfbcedef21ea24d7101c20eee in lucene-solr's branch refs/heads/branch_8x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3246b26 ] LUCENE-9147: Fix codec excludes. > Move the stored fields index off-heap > - > > Key: LUCENE-9147 > URL: https://issues.apache.org/jira/browse/LUCENE-9147 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Now that the terms index is off-heap by default, it's almost embarrassing > that many indices spend most of their memory usage on the stored fields index > or the term vectors index, which are much less performance-sensitive than the > terms index. We should move them off-heap too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9147) Move the stored fields index off-heap
[ https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-9147. -- Fix Version/s: 8.5 Resolution: Fixed > Move the stored fields index off-heap > - > > Key: LUCENE-9147 > URL: https://issues.apache.org/jira/browse/LUCENE-9147 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.5 > > Time Spent: 40m > Remaining Estimate: 0h > > Now that the terms index is off-heap by default, it's almost embarrassing > that many indices spend most of their memory usage on the stored fields index > or the term vectors index, which are much less performance-sensitive than the > terms index. We should move them off-heap too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031461#comment-17031461 ] Xin-Chun Zhang commented on LUCENE-9004: ??You don't share your test code, but I suspect you open new IndexReader every time you issue a query??? [~tomoko] The test code can be found in [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java]. Yes, I opened a new reader for each query in hope that IVFFlat and HNSW are compared in a fair condition since IVFFlat do not have cache. I now realize it may lead to OOM, hence replacing with a shard IndexReader and the problem resolved. Update -- Top 1 in-set (query vector is in the candidate data set) recall results on SIFT1M data set ([http://corpus-texmex.irisa.fr/]) of IVFFlat and HNSW are as follows, IVFFlat (no cache, reuse IndexReader) ||nprobe||avg. search time (ms)||recall percent (%)|| |8|13.3165|64.8| |16|13.968|79.65| |32|16.951|89.3| |64|21.631|95.6| |128|31.633|98.8| HNSW (static cache, reuse IndexReader) ||avg. search time (ms)||recall percent (%)|| |6.3|{color:#FF}20.45{color}| It can readily be shown that HNSW performs much better in query time. But I was surprised that top 1 in-set recall percent of HNSW is so low. It shouldn't be a problem of algorithm itself, but more likely a problem of implementation or test code. I will check it this weekend. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from s
[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028283#comment-17028283 ] Xin-Chun Zhang edited comment on LUCENE-9004 at 2/6/20 10:52 AM: - ??"Is it making life difficult to keep them separate?"?? [~sokolov] No, we can keep them separate at present. I have merged your [branch|[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]] into my person [github|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] in order to do the comparison between IVFFlat and HNSW. And I reused some work that [~tomoko] and you did. Code refactoring is required when we are going to commit. ??"Have you tried comparing them on real data?"?? [~yurymalkov], [~mikemccand] Thanks for your advice. I haven't do it yet, and will do it soon. *Update – Feb. 4, 2020* I have added two performance test tool (KnnIvfPerformTester/KnnIvfAndGraphPerformTester) into my personal branch. And sift1M dataset (1000,000 base vectors with 128 dimensions, [http://corpus-texmex.irisa.fr/]) is employed for the test. Top 1 recall performance of IVFFlat is as follows, *a new IndexReader was opened for each query*, centroids=707 ||nprobe||avg. search time (ms)||recall percent (%)|| |8|71.314|69.15| |16|121.7565|82.3| |32|155.692|92.85| |64|159.3655|98.7| |128|217.5205|99.9| centroids=4000 ||nprobe||avg. search time (ms)||recall percent (%)|| |8|56.3745|65.35| |16|59.5435|78.85| |32|71.751|89.85| |64|90.396|96.25| |128|135.3805|99.3| Unfortunately, I couldn't obtain the corresponding results of HNSW due to the out of memory error in my PC. A special case with 2,000 base vectors demonstrates that IVFFlat is faster and more accurate. HNSW may outperform IVFFlat on larger data sets when larger memory is available, as shown in [https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]. was (Author: irvingzhang): ??"Is it making life difficult to keep them separate?"?? [~sokolov] No, we can keep them separate at present. I have merged your [branch|[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]] into my person [github|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] in order to do the comparison between IVFFlat and HNSW. And I reused some work that [~tomoko] and you did. Code refactoring is required when we are going to commit. ??"Have you tried comparing them on real data?"?? [~yurymalkov], [~mikemccand] Thanks for your advice. I haven't do it yet, and will do it soon. *Update – Feb. 4, 2020* I have added two performance test tool (KnnIvfPerformTester/KnnIvfAndGraphPerformTester) into my personal branch. And sift1M dataset (1000,000 base vectors with 128 dimensions, [http://corpus-texmex.irisa.fr/]) is employed for the test. Top 1 recall performance of IVFFlat is as follows, centroids=707 ||nprobe||avg. search time (ms)||recall percent (%)|| |8|71.314|69.15| |16|121.7565|82.3| |32|155.692|92.85| |64|159.3655|98.7| |128|217.5205|99.9| centroids=4000 ||nprobe||avg. search time (ms)||recall percent (%)|| |8|56.3745|65.35| |16|59.5435|78.85| |32|71.751|89.85| |64|90.396|96.25| |128|135.3805|99.3| Unfortunately, I couldn't obtain the corresponding results of HNSW due to the out of memory error in my PC. A special case with 2,000 base vectors demonstrates that IVFFlat is faster and more accurate. HNSW may outperform IVFFlat on larger data sets when larger memory is available, as shown in [https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume yo
[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields
markharwood commented on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-582866319 >And how can indexing and searching get so much faster when compress/decompress is in the path! I tried benchmarking some straight-forward file read and write operations (no Lucene) and couldn't show LZ4 compression being faster (although it wasn't that much slower). Maybe the rate-limited merging in Lucene plays a part and size therefore matters in that context? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14194) Allow Highlighting to work for indexes with uniqueKey that is not stored
[ https://issues.apache.org/jira/browse/SOLR-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Wislowski updated SOLR-14194: - Attachment: SOLR-14194.patch > Allow Highlighting to work for indexes with uniqueKey that is not stored > > > Key: SOLR-14194 > URL: https://issues.apache.org/jira/browse/SOLR-14194 > Project: Solr > Issue Type: Improvement > Components: highlighter >Affects Versions: master (9.0) >Reporter: Andrzej Wislowski >Assignee: David Smiley >Priority: Minor > Labels: highlighter > Attachments: SOLR-14194.patch, SOLR-14194.patch, SOLR-14194.patch, > SOLR-14194.patch > > > Highlighting requires uniqueKey to be a stored field. I have changed > Highlighter allow returning results on indexes with uniqueKey that is a not > stored field, but saved as a docvalue type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9207) Don't build SpanQuery in QueryBuilder
[ https://issues.apache.org/jira/browse/LUCENE-9207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031526#comment-17031526 ] Jim Ferenczi commented on LUCENE-9207: -- +1, I agree that the current optimization can help in some cases but the crazy expansion that can arises on phrase queries of shingles of different size should be considered a bug. We already disable graph queries in Elasticsearch if the analyzer contains a filter that is known to produce paths that don't align (shingles of different size in the same field) so we could probably add the same mechanism in Solr. I am also less worried by this issue now that we eagerly check the number of path while building (and throw max boolean clause if the number of paths is above the max boolean clause). > Don't build SpanQuery in QueryBuilder > - > > Key: LUCENE-9207 > URL: https://issues.apache.org/jira/browse/LUCENE-9207 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Subtask of LUCENE-9204. QueryBuilder currently has special logic for graph > phrase queries with no slop, constructing a spanquery that attempts to follow > all paths using a combination of OR and NEAR queries. Given the known bugs > in this type of query (LUCENE-7398) and that we would like to move span > queries out of core in any case, we should remove this logic and just build a > disjunction of phrase queries, one phrase per path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14194) Allow Highlighting to work for indexes with uniqueKey that is not stored
[ https://issues.apache.org/jira/browse/SOLR-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031528#comment-17031528 ] Andrzej Wislowski commented on SOLR-14194: -- [~dsmiley] I have added updated patch with test fix > Allow Highlighting to work for indexes with uniqueKey that is not stored > > > Key: SOLR-14194 > URL: https://issues.apache.org/jira/browse/SOLR-14194 > Project: Solr > Issue Type: Improvement > Components: highlighter >Affects Versions: master (9.0) >Reporter: Andrzej Wislowski >Assignee: David Smiley >Priority: Minor > Labels: highlighter > Attachments: SOLR-14194.patch, SOLR-14194.patch, SOLR-14194.patch, > SOLR-14194.patch > > > Highlighting requires uniqueKey to be a stored field. I have changed > Highlighter allow returning results on indexes with uniqueKey that is a not > stored field, but saved as a docvalue type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375252133 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -61,11 +66,13 @@ IndexOutput data, meta; final int maxDoc; + private SegmentWriteState state; Review comment: make it final? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375278323 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } + docLengths[numDocsInCurrentBlock] = v.length; + block = ArrayUtil.grow(block, uncompressedBlockLength + v.length); + System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, v.length); + uncompressedBlockLength += v.length; + numDocsInCurrentBlock++; + if (numDocsInCurrentBlock == Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) { +flushData(); + } } -meta.writeInt(numDocsWithField); -meta.writeInt(minLength); -meta.writeInt(maxLength); -if (maxLength > minLength) { - start = data.getFilePointer(); - meta.writeLong(start); +private void flushData() throws IOException { + if (numDocsInCurrentBlock > 0) { +// Write offset to this block to temporary offsets file +totalChunks++; +long thisBlockStartPointer = data.getFilePointer(); +data.writeVInt(numDocsInCurrentBlock); +for (int i = 0; i < numDocsInCurrentBlock; i++) { + data.writeVInt(docLengths[i]); +} +maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, uncompressedBlockLength); +LZ4.compress(block, 0, uncompressedBlockLength, data, ht); +numDocsInCurrentBlock = 0; +uncompressedBlockLength = 0; +maxPointer = data.getFilePointer(); +tempBinaryOffsets.writeVLong(maxPointer - thisBlockStartPointer); + } +} + +void writeMetaData() throws IOException { + if (blo
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375274563 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } Review comment: we usually do this like that instead, which helps avoid catching Throwable ``` boolean success = false; try { // write header } finally { if (success == false) { // close } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375273736 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } + docLengths[numDocsInCurrentBlock] = v.length; + block = ArrayUtil.grow(block, uncompressedBlockLength + v.length); + System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, v.length); + uncompressedBlockLength += v.length; + numDocsInCurrentBlock++; + if (numDocsInCurrentBlock == Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) { +flushData(); + } } -meta.writeInt(numDocsWithField); -meta.writeInt(minLength); -meta.writeInt(maxLength); -if (maxLength > minLength) { - start = data.getFilePointer(); - meta.writeLong(start); +private void flushData() throws IOException { + if(numDocsInCurrentBlock > 0) { +// Write offset to this block to temporary offsets file +totalChunks++; +long thisBlockStartPointer = data.getFilePointer(); +data.writeVInt(numDocsInCurrentBlock); +for (int i = 0; i < numDocsInCurrentBlock; i++) { Review comment: +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375252907 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; Review comment: ```suggestion byte[] block = new byte [1024 * 16]; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375275497 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } Review comment: it looks like we could set `blockAddressesStart` in the constructor instead? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375252836 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; Review comment: we usually don't let spaces between the type of array elements and `[]` ```suggestion int[] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375277898 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } + docLengths[numDocsInCurrentBlock] = v.length; + block = ArrayUtil.grow(block, uncompressedBlockLength + v.length); + System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, v.length); + uncompressedBlockLength += v.length; + numDocsInCurrentBlock++; + if (numDocsInCurrentBlock == Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) { +flushData(); + } } -meta.writeInt(numDocsWithField); -meta.writeInt(minLength); -meta.writeInt(maxLength); -if (maxLength > minLength) { - start = data.getFilePointer(); - meta.writeLong(start); +private void flushData() throws IOException { + if (numDocsInCurrentBlock > 0) { +// Write offset to this block to temporary offsets file +totalChunks++; +long thisBlockStartPointer = data.getFilePointer(); +data.writeVInt(numDocsInCurrentBlock); +for (int i = 0; i < numDocsInCurrentBlock; i++) { + data.writeVInt(docLengths[i]); +} +maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, uncompressedBlockLength); +LZ4.compress(block, 0, uncompressedBlockLength, data, ht); Review comment: ```suggestion LZ4.compress(block, 0, uncompressedBlockLength, data, ht); ``` This is an automated message from the Apache Git Service. To respond to the message, please log on
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375827346 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -742,6 +755,107 @@ public BytesRef binaryValue() throws IOException { }; } } + } + + // Decompresses blocks of binary values to retrieve content + class BinaryDecoder { + +private final LongValues addresses; +private final IndexInput compressedData; +// Cache of last uncompressed block +private long lastBlockId = -1; +private int []uncompressedDocEnds = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +private int uncompressedBlockLength = 0; +private int numDocsInBlock = 0; +private final byte[] uncompressedBlock; +private BytesRef uncompressedBytesRef; + +public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize) { + super(); + this.addresses = addresses; + this.compressedData = compressedData; + // pre-allocate a byte array large enough for the biggest uncompressed block needed. + this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; + Review comment: we could initialize uncompressedBytesRef from the uncompressed block: `uncompressedBytesRef = new BytesRef(uncompressedBlock)` and avoid creating new BytesRefs over and over in `decode` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14194) Allow Highlighting to work for indexes with uniqueKey that is not stored
[ https://issues.apache.org/jira/browse/SOLR-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031583#comment-17031583 ] Lucene/Solr QA commented on SOLR-14194: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate ref guide {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 51m 37s{color} | {color:red} core in the patch failed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 56m 15s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | solr.search.TestTermsQParserPlugin | | | solr.TestDistributedGrouping | | | solr.cloud.HttpPartitionWithTlogReplicasTest | | | solr.handler.component.DistributedSpellCheckComponentTest | | | solr.analysis.PathHierarchyTokenizerFactoryTest | | | solr.update.processor.AtomicUpdatesTest | | | solr.update.PeerSyncTest | | | solr.search.stats.TestLRUStatsCache | | | solr.TestHighlightDedupGrouping | | | solr.handler.component.DistributedDebugComponentTest | | | solr.DisMaxRequestHandlerTest | | | solr.update.PeerSyncWithBufferUpdatesTest | | | solr.handler.component.DistributedFacetPivotSmallTest | | | solr.cloud.BasicZkTest | | | solr.search.function.TestSortByMinMaxFunction | | | solr.search.stats.TestExactStatsCache | | | solr.update.PeerSyncWithLeaderTest | | | solr.search.facet.DistributedFacetSimpleRefinementLongTailTest | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | SOLR-14194 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12992768/SOLR-14194.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns validaterefguide | | uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / fdf5ade727e | | ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 | | Default Java | LTS | | unit | https://builds.apache.org/job/PreCommit-SOLR-Build/680/artifact/out/patch-unit-solr_core.txt | | Test Results | https://builds.apache.org/job/PreCommit-SOLR-Build/680/testReport/ | | modules | C: solr/core solr/solr-ref-guide U: solr | | Console output | https://builds.apache.org/job/PreCommit-SOLR-Build/680/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > Allow Highlighting to work for indexes with uniqueKey that is not stored > > > Key: SOLR-14194 > URL: https://issues.apache.org/jira/browse/SOLR-14194 > Project: Solr > Issue Type: Improvement > Components: highlighter >Affects Versions: master (9.0) >Reporter: Andrzej Wislowski >Assignee: David Smiley >Priority: Minor > Labels: highlighter > Attachments: SOLR-14194.patch, SOLR-14194.patch, SOLR-14194.patch, > SOLR-14194.patch > > > Highlighting requires uniqueKey to be a stored field. I have changed > Highlighter allow returning results on indexes with uniqueKey that is a not > stored field, but saved as a docvalue type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe,
[GitHub] [lucene-solr] romseygeek merged pull request #1097: LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals
romseygeek merged pull request #1097: LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals URL: https://github.com/apache/lucene-solr/pull/1097 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9099) Correctly handle repeats in ordered and unordered intervals
[ https://issues.apache.org/jira/browse/LUCENE-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031632#comment-17031632 ] ASF subversion and git services commented on LUCENE-9099: - Commit 7c1ba1aebeea540b67ae304deee60162baee2e12 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7c1ba1a ] LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals (#1097) If you have repeating intervals in an ordered or unordered interval source, you currently get somewhat confusing behaviour: * `ORDERED(a, a, b)` will return an extra interval over just a b if it first matches a a b, meaning that you can get incorrect results if used in a `CONTAINING` filter - `CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a b y` * `UNORDERED(a, a)` will match on documents that just containg a single a. This commit adds a RepeatingIntervalsSource that correctly handles repeats within ordered and unordered sources. It also changes the way that gaps are calculated within ordered and unordered sources, by using a new width() method on IntervalIterator. The default implementation just returns end() - start() + 1, but RepeatingIntervalsSource instead returns the sum of the widths of its child iterators. This preserves maxgaps filtering on ordered and unordered sources that contain repeats. In order to correctly handle matches in this scenario, IntervalsSource#matches now always returns an explicit IntervalsMatchesIterator rather than a plain MatchesIterator, which adds gaps() and width() methods so that submatches can be combined in the same way that subiterators are. Extra checks have been added to checkIntervals() to ensure that the same intervals are returned by both iterator and matches, and a fix to DisjunctionIntervalIterator#matches() is also included - DisjunctionIntervalIterator minimizes its intervals, while MatchesUtils.disjunction does not, so there was a discrepancy between the two methods. > Correctly handle repeats in ordered and unordered intervals > --- > > Key: LUCENE-9099 > URL: https://issues.apache.org/jira/browse/LUCENE-9099 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > If you have repeating intervals in an ordered or unordered interval source, > you currently get somewhat confusing behaviour: > * ORDERED(a, a, b) will return an extra interval over just `a b` if it first > matches `a a b`, meaning that you can get incorrect results if used in a > CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on > the document `a x a b y` > * UNORDERED(a, a) will match on documents that just containg a single `a`. > It is possible to deal with the unordered case when building sources by > rewriting duplicates to nested ORDERED clauses, so that UNORDERED(a, b, c, a, > b) becomes UNORDERED(ORDERED(a, a), ORDERED(b, b), c), but this then breaks > MAXGAPS filtering. > We should try and fix this within intervals themselves. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9099) Correctly handle repeats in ordered and unordered intervals
[ https://issues.apache.org/jira/browse/LUCENE-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031648#comment-17031648 ] ASF subversion and git services commented on LUCENE-9099: - Commit aa916bac3c3369a461afa06e384e070657c32973 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=aa916ba ] LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals (#1097) If you have repeating intervals in an ordered or unordered interval source, you currently get somewhat confusing behaviour: * `ORDERED(a, a, b)` will return an extra interval over just a b if it first matches a a b, meaning that you can get incorrect results if used in a `CONTAINING` filter - `CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a b y` * `UNORDERED(a, a)` will match on documents that just containg a single a. This commit adds a RepeatingIntervalsSource that correctly handles repeats within ordered and unordered sources. It also changes the way that gaps are calculated within ordered and unordered sources, by using a new width() method on IntervalIterator. The default implementation just returns end() - start() + 1, but RepeatingIntervalsSource instead returns the sum of the widths of its child iterators. This preserves maxgaps filtering on ordered and unordered sources that contain repeats. In order to correctly handle matches in this scenario, IntervalsSource#matches now always returns an explicit IntervalsMatchesIterator rather than a plain MatchesIterator, which adds gaps() and width() methods so that submatches can be combined in the same way that subiterators are. Extra checks have been added to checkIntervals() to ensure that the same intervals are returned by both iterator and matches, and a fix to DisjunctionIntervalIterator#matches() is also included - DisjunctionIntervalIterator minimizes its intervals, while MatchesUtils.disjunction does not, so there was a discrepancy between the two methods. > Correctly handle repeats in ordered and unordered intervals > --- > > Key: LUCENE-9099 > URL: https://issues.apache.org/jira/browse/LUCENE-9099 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > If you have repeating intervals in an ordered or unordered interval source, > you currently get somewhat confusing behaviour: > * ORDERED(a, a, b) will return an extra interval over just `a b` if it first > matches `a a b`, meaning that you can get incorrect results if used in a > CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on > the document `a x a b y` > * UNORDERED(a, a) will match on documents that just containg a single `a`. > It is possible to deal with the unordered case when building sources by > rewriting duplicates to nested ORDERED clauses, so that UNORDERED(a, b, c, a, > b) becomes UNORDERED(ORDERED(a, a), ORDERED(b, b), c), but this then breaks > MAXGAPS filtering. > We should try and fix this within intervals themselves. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9099) Correctly handle repeats in ordered and unordered intervals
[ https://issues.apache.org/jira/browse/LUCENE-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Woodward resolved LUCENE-9099. --- Fix Version/s: 8.5 Resolution: Fixed > Correctly handle repeats in ordered and unordered intervals > --- > > Key: LUCENE-9099 > URL: https://issues.apache.org/jira/browse/LUCENE-9099 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 8.5 > > Time Spent: 20m > Remaining Estimate: 0h > > If you have repeating intervals in an ordered or unordered interval source, > you currently get somewhat confusing behaviour: > * ORDERED(a, a, b) will return an extra interval over just `a b` if it first > matches `a a b`, meaning that you can get incorrect results if used in a > CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on > the document `a x a b y` > * UNORDERED(a, a) will match on documents that just containg a single `a`. > It is possible to deal with the unordered case when building sources by > rewriting duplicates to nested ORDERED clauses, so that UNORDERED(a, b, c, a, > b) becomes UNORDERED(ORDERED(a, a), ORDERED(b, b), c), but this then breaks > MAXGAPS filtering. > We should try and fix this within intervals themselves. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields
markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375903347 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } Review comment: I tried that and it didn't work - something else was writing to data in between constructor and addDoc calls This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mocobeta opened a new pull request #1242: LUCENE-9201: Port documentation-lint task to Gradle build
mocobeta opened a new pull request #1242: LUCENE-9201: Port documentation-lint task to Gradle build URL: https://github.com/apache/lucene-solr/pull/1242 # Description This PR adds an equivalent to "documentation-lint" to Gradle build. # Solution The `gradle/validation/documentation-lint.gradle` includes - `documentationLint` task that supposed to be called from `precommit` task, - a root project level sub-task `checkBrokenLinks`, - sub-project level sub-tasks `ecjJavadocLint`, `checkMissingJavadocsClass`, and `checkMissingJavadocsMethod`. # Note For now, Python linters - `checkBrokenLinks`, `checkMissingJavadocsClass` and `checkMissingJavadocsMethod` - will fail because the Gradle-generated Javadocs seems to be slightly different to Ant-generated ones. e.g.; - Javadoc directory structure: "ant documentation" generates "analyzers-common" docs dir for "analysis/common" module, but "gradlew javadoc" generates "analysis/common" for the same module. I think we can adjust the structure, but where is the suitable place to do so? - Package summary: "ant documentation" uses "package.html" as package summary description, but "gradlew javadoc" ignores "package.html" (so some packages lacks summary description in "package-summary.html" when building javadocs by Gradle). We might be able to make Gradle Javadoc task to properly handle "package.html" files with some options. Or, should we replace all "package.html" with "package-info.java" at this time? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9199) can't build javadocs on java 13+
[ https://issues.apache.org/jira/browse/LUCENE-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031682#comment-17031682 ] ASF subversion and git services commented on LUCENE-9199: - Commit 7f4560c59a71f271058f13b3b30901ca8c233022 in lucene-solr's branch refs/heads/master from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7f4560c ] LUCENE-9199: allow building javadocs on java 13+ > can't build javadocs on java 13+ > > > Key: LUCENE-9199 > URL: https://issues.apache.org/jira/browse/LUCENE-9199 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9199.patch > > > The build tries to pass an option (--no-module-directories) that is no longer > valid: https://bugs.openjdk.java.net/browse/JDK-8215582 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9199) can't build javadocs on java 13+
[ https://issues.apache.org/jira/browse/LUCENE-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9199: Affects Version/s: master (9.0) > can't build javadocs on java 13+ > > > Key: LUCENE-9199 > URL: https://issues.apache.org/jira/browse/LUCENE-9199 > Project: Lucene - Core > Issue Type: Task >Affects Versions: master (9.0) >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9199.patch > > > The build tries to pass an option (--no-module-directories) that is no longer > valid: https://bugs.openjdk.java.net/browse/JDK-8215582 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9199) can't build javadocs on java 13+
[ https://issues.apache.org/jira/browse/LUCENE-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9199: Fix Version/s: master (9.0) > can't build javadocs on java 13+ > > > Key: LUCENE-9199 > URL: https://issues.apache.org/jira/browse/LUCENE-9199 > Project: Lucene - Core > Issue Type: Task >Affects Versions: master (9.0) >Reporter: Robert Muir >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9199.patch > > > The build tries to pass an option (--no-module-directories) that is no longer > valid: https://bugs.openjdk.java.net/browse/JDK-8215582 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9199) can't build javadocs on java 13+
[ https://issues.apache.org/jira/browse/LUCENE-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-9199. - Resolution: Fixed The problem only impacted master branch, not branch_8x. > can't build javadocs on java 13+ > > > Key: LUCENE-9199 > URL: https://issues.apache.org/jira/browse/LUCENE-9199 > Project: Lucene - Core > Issue Type: Task >Affects Versions: master (9.0) >Reporter: Robert Muir >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9199.patch > > > The build tries to pass an option (--no-module-directories) that is no longer > valid: https://bugs.openjdk.java.net/browse/JDK-8215582 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields
markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375914836 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } + docLengths[numDocsInCurrentBlock] = v.length; + block = ArrayUtil.grow(block, uncompressedBlockLength + v.length); + System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, v.length); + uncompressedBlockLength += v.length; + numDocsInCurrentBlock++; + if (numDocsInCurrentBlock == Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) { +flushData(); + } } -meta.writeInt(numDocsWithField); -meta.writeInt(minLength); -meta.writeInt(maxLength); -if (maxLength > minLength) { - start = data.getFilePointer(); - meta.writeLong(start); +private void flushData() throws IOException { + if (numDocsInCurrentBlock > 0) { +// Write offset to this block to temporary offsets file +totalChunks++; +long thisBlockStartPointer = data.getFilePointer(); +data.writeVInt(numDocsInCurrentBlock); +for (int i = 0; i < numDocsInCurrentBlock; i++) { + data.writeVInt(docLengths[i]); +} +maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, uncompressedBlockLength); +LZ4.compress(block, 0, uncompressedBlockLength, data, ht); +numDocsInCurrentBlock = 0; +uncompressedBlockLength = 0; +maxPointer = data.getFilePointer(); +tempBinaryOffsets.writeVLong(maxPointer - thisBlockStartPointer); + } +} + +void writeMetaData() throws IOException { + if
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields
markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375922373 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } Review comment: What was the "+1" comment for line 407 about? I've seen encoding elsewhere that have n+1 offsets to record start of each value and the last offset is effectively the end of the last value. In this scenario I'm writing n value lengths. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031702#comment-17031702 ] Tomoko Uchida commented on LUCENE-9201: --- [~erickerickson] I added sub-tasks equivalent to the ant targets. - -check-broken-links (this internally calls {{dev-tools/scripts/checkJavadocLinks.py}}) - -check-missing-javadocs (this internally calls {{dev-tools/scripts/checkJavaDocs.py}} ) And I opened a PR :) [https://github.com/apache/lucene-solr/pull/1242] I think this is almost equivalent to Ant's "documentation-lint", with some notes below. [~erickerickson] [~dweiss] Could you review it? *Note:* For now, Python linters - {{checkBrokenLinks}}, {{checkMissingJavadocsClass}} and {{checkMissingJavadocsMethod}} - will fail because the Gradle-generated Javadocs seems to be slightly different to Ant-generated ones. * Javadoc directory structure: "ant documentation" generates "analyzers-common" docs dir for "analysis/common" module, but "gradlew javadoc" generates "analysis/common" for the same module. I think we can adjust the structure, but where is the suitable place to do so? * Package summary: "ant documentation" uses "package.html" as package summary description, but "gradlew javadoc" ignores "package.html" (so some packages lacks summary description in "package-summary.html" when building javadocs by Gradle). We might be able to make Gradle Javadoc task to properly handle "package.html" files with some options. Or, should we replace all "package.html" with "package-info.java" at this time? After Gradle generated Javadoc is fixed, we can return to here and complete this sub-task. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375927967 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } Review comment: It was about optimizing for the case that all values have the same length. In that case we could still one bit of the first length to mean that all values have the same length for instance? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap
[ https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031712#comment-17031712 ] ASF subversion and git services commented on LUCENE-9147: - Commit 85dba7356f32da6d577550a6dd6c5e6244556d87 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=85dba73 ] LUCENE-9147: Make sure temporary files get deleted on all code paths. > Move the stored fields index off-heap > - > > Key: LUCENE-9147 > URL: https://issues.apache.org/jira/browse/LUCENE-9147 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.5 > > Time Spent: 40m > Remaining Estimate: 0h > > Now that the terms index is off-heap by default, it's almost embarrassing > that many indices spend most of their memory usage on the stored fields index > or the term vectors index, which are much less performance-sensitive than the > terms index. We should move them off-heap too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap
[ https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031711#comment-17031711 ] ASF subversion and git services commented on LUCENE-9147: - Commit 6a380798a27e1ce777843a4322afba463e383acc in lucene-solr's branch refs/heads/branch_8x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=6a38079 ] LUCENE-9147: Make sure temporary files get deleted on all code paths. > Move the stored fields index off-heap > - > > Key: LUCENE-9147 > URL: https://issues.apache.org/jira/browse/LUCENE-9147 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.5 > > Time Spent: 40m > Remaining Estimate: 0h > > Now that the terms index is off-heap by default, it's almost embarrassing > that many indices spend most of their memory usage on the stored fields index > or the term vectors index, which are much less performance-sensitive than the > terms index. We should move them off-heap too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9077) Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9077: -- Attachment: LUCENE-9077-javadoc-locale-en-US.patch > Gradle build > > > Key: LUCENE-9077 > URL: https://issues.apache.org/jira/browse/LUCENE-9077 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9077-javadoc-locale-en-US.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > This task focuses on providing gradle-based build equivalent for Lucene and > Solr (on master branch). See notes below on why this respin is needed. > The code lives on *gradle-master* branch. It is kept with sync with *master*. > Try running the following to see an overview of helper guides concerning > typical workflow, testing and ant-migration helpers: > gradlew :help > A list of items that needs to be added or requires work. If you'd like to > work on any of these, please add your name to the list. Once you have a > patch/ pull request let me (dweiss) know - I'll try to coordinate the merges. > * (/) Apply forbiddenAPIs > * (/) Generate hardware-aware gradle defaults for parallelism (count of > workers and test JVMs). > * (/) Fail the build if --tests filter is applied and no tests execute > during the entire build (this allows for an empty set of filtered tests at > single project level). > * (/) Port other settings and randomizations from common-build.xml > * (/) Configure security policy/ sandboxing for tests. > * (/) test's console output on -Ptests.verbose=true > * (/) add a :helpDeps explanation to how the dependency system works > (palantir plugin, lockfile) and how to retrieve structured information about > current dependencies of a given module (in a tree-like output). > * (/) jar checksums, jar checksum computation and validation. This should be > done without intermediate folders (directly on dependency sets). > * (/) verify min. JVM version and exact gradle version on build startup to > minimize odd build side-effects > * (/) Repro-line for failed tests/ runs. > * (/) add a top-level README note about building with gradle (and the > required JVM). > * (/) add an equivalent of 'validate-source-patterns' > (check-source-patterns.groovy) to precommit. > * (/) add an equivalent of 'rat-sources' to precommit. > * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) > to precommit. > * (/) javadoc compilation > Hard-to-implement stuff already investigated: > * (/) (done) -*Printing console output of failed tests.* There doesn't seem > to be any way to do this in a reasonably efficient way. There are onOutput > listeners but they're slow to operate and solr tests emit *tons* of output so > it's an overkill.- > * (!) (LUCENE-9120) *Tests working with security-debug logs or other > JVM-early log output*. Gradle's test runner works by redirecting Java's > stdout/ syserr so this just won't work. Perhaps we can spin the ant-based > test runner for such corner-cases. > Of lesser importance: > * Add an equivalent of 'documentation-lint" to precommit. > * (/) Do not require files to be committed before running precommit. (staged > files are fine). > * (/) add rendering of javadocs (gradlew javadoc) > * Attach javadocs to maven publications. > * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid > it'll be difficult to run it sensibly because gradle doesn't offer cwd > separation for the forked test runners. > * if you diff solr packaged distribution against ant-created distribution > there are minor differences in library versions and some JARs are excluded/ > moved around. I didn't try to force these as everything seems to work (tests, > etc.) – perhaps these differences should be fixed in the ant build instead. > * [EOE] identify and port various "regenerate" tasks from ant builds > (javacc, precompiled automata, etc.) > * Fill in POM details in gradle/defaults-maven.gradle so that they reflect > the previous content better (dependencies aside). > * Add any IDE integration layers that should be added (I use IntelliJ and it > imports the project out of the box, without the need for any special tuning). > * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; > currently XSLT...) > * I didn't bother adding Solr dist/test-framework to packaging (who'd use it > from a binary distribution? > * There is some python execution in check-broken-links and > check-missing-javadocs, not sure if it's been ported > * Nightly-smoke also have some python execution, not sure of the status. > * Precommit doesn't catch unused imports > > *{color:#ff}Note:{color}* this builds on the work
[jira] [Created] (SOLR-14246) Can't create core when server/solr has a file whose name starts with same string as core
arnoldbird created SOLR-14246: - Summary: Can't create core when server/solr has a file whose name starts with same string as core Key: SOLR-14246 URL: https://issues.apache.org/jira/browse/SOLR-14246 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Affects Versions: 6.6.6 Environment: Centos 7 Reporter: arnoldbird If server/solr contains a file named... something-archive.tar.gz ...and you try to create a core named "something," the response is... {{ERROR:Core 'something' already exists!}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031718#comment-17031718 ] Robert Muir commented on LUCENE-9201: - I think gradle provides slightly different options to the javadoc tool than ant, which creates the problem. For example, gradle build has only one "linkoffline" but ant build has two. Such small differences could create broken links. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9077) Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031732#comment-17031732 ] Tomoko Uchida commented on LUCENE-9077: --- I found a JDK Javadoc tool related issue which was fixed on ant build on https://issues.apache.org/jira/browse/LUCENE-8738?focusedCommentId=16822659&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16822659. I attached the same workaround patch [^LUCENE-9077-javadoc-locale-en-US.patch] for graldle build. Will commit it soon. > Gradle build > > > Key: LUCENE-9077 > URL: https://issues.apache.org/jira/browse/LUCENE-9077 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9077-javadoc-locale-en-US.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > This task focuses on providing gradle-based build equivalent for Lucene and > Solr (on master branch). See notes below on why this respin is needed. > The code lives on *gradle-master* branch. It is kept with sync with *master*. > Try running the following to see an overview of helper guides concerning > typical workflow, testing and ant-migration helpers: > gradlew :help > A list of items that needs to be added or requires work. If you'd like to > work on any of these, please add your name to the list. Once you have a > patch/ pull request let me (dweiss) know - I'll try to coordinate the merges. > * (/) Apply forbiddenAPIs > * (/) Generate hardware-aware gradle defaults for parallelism (count of > workers and test JVMs). > * (/) Fail the build if --tests filter is applied and no tests execute > during the entire build (this allows for an empty set of filtered tests at > single project level). > * (/) Port other settings and randomizations from common-build.xml > * (/) Configure security policy/ sandboxing for tests. > * (/) test's console output on -Ptests.verbose=true > * (/) add a :helpDeps explanation to how the dependency system works > (palantir plugin, lockfile) and how to retrieve structured information about > current dependencies of a given module (in a tree-like output). > * (/) jar checksums, jar checksum computation and validation. This should be > done without intermediate folders (directly on dependency sets). > * (/) verify min. JVM version and exact gradle version on build startup to > minimize odd build side-effects > * (/) Repro-line for failed tests/ runs. > * (/) add a top-level README note about building with gradle (and the > required JVM). > * (/) add an equivalent of 'validate-source-patterns' > (check-source-patterns.groovy) to precommit. > * (/) add an equivalent of 'rat-sources' to precommit. > * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) > to precommit. > * (/) javadoc compilation > Hard-to-implement stuff already investigated: > * (/) (done) -*Printing console output of failed tests.* There doesn't seem > to be any way to do this in a reasonably efficient way. There are onOutput > listeners but they're slow to operate and solr tests emit *tons* of output so > it's an overkill.- > * (!) (LUCENE-9120) *Tests working with security-debug logs or other > JVM-early log output*. Gradle's test runner works by redirecting Java's > stdout/ syserr so this just won't work. Perhaps we can spin the ant-based > test runner for such corner-cases. > Of lesser importance: > * Add an equivalent of 'documentation-lint" to precommit. > * (/) Do not require files to be committed before running precommit. (staged > files are fine). > * (/) add rendering of javadocs (gradlew javadoc) > * Attach javadocs to maven publications. > * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid > it'll be difficult to run it sensibly because gradle doesn't offer cwd > separation for the forked test runners. > * if you diff solr packaged distribution against ant-created distribution > there are minor differences in library versions and some JARs are excluded/ > moved around. I didn't try to force these as everything seems to work (tests, > etc.) – perhaps these differences should be fixed in the ant build instead. > * [EOE] identify and port various "regenerate" tasks from ant builds > (javacc, precompiled automata, etc.) > * Fill in POM details in gradle/defaults-maven.gradle so that they reflect > the previous content better (dependencies aside). > * Add any IDE integration layers that should be added (I use IntelliJ and it > imports the project out of the box, without the need for any special tuning). > * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; > currently XSLT...) > * I didn't bother adding Solr dist/test-
[jira] [Commented] (LUCENE-9077) Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031739#comment-17031739 ] ASF subversion and git services commented on LUCENE-9077: - Commit f3cd1dbde36d8fd85bd2e87dcfaffc8b03eec87c in lucene-solr's branch refs/heads/master from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f3cd1db ] LUCENE-9077: Force locale en_US on Javadoc task (workaroud for JDK-8222793) > Gradle build > > > Key: LUCENE-9077 > URL: https://issues.apache.org/jira/browse/LUCENE-9077 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9077-javadoc-locale-en-US.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > This task focuses on providing gradle-based build equivalent for Lucene and > Solr (on master branch). See notes below on why this respin is needed. > The code lives on *gradle-master* branch. It is kept with sync with *master*. > Try running the following to see an overview of helper guides concerning > typical workflow, testing and ant-migration helpers: > gradlew :help > A list of items that needs to be added or requires work. If you'd like to > work on any of these, please add your name to the list. Once you have a > patch/ pull request let me (dweiss) know - I'll try to coordinate the merges. > * (/) Apply forbiddenAPIs > * (/) Generate hardware-aware gradle defaults for parallelism (count of > workers and test JVMs). > * (/) Fail the build if --tests filter is applied and no tests execute > during the entire build (this allows for an empty set of filtered tests at > single project level). > * (/) Port other settings and randomizations from common-build.xml > * (/) Configure security policy/ sandboxing for tests. > * (/) test's console output on -Ptests.verbose=true > * (/) add a :helpDeps explanation to how the dependency system works > (palantir plugin, lockfile) and how to retrieve structured information about > current dependencies of a given module (in a tree-like output). > * (/) jar checksums, jar checksum computation and validation. This should be > done without intermediate folders (directly on dependency sets). > * (/) verify min. JVM version and exact gradle version on build startup to > minimize odd build side-effects > * (/) Repro-line for failed tests/ runs. > * (/) add a top-level README note about building with gradle (and the > required JVM). > * (/) add an equivalent of 'validate-source-patterns' > (check-source-patterns.groovy) to precommit. > * (/) add an equivalent of 'rat-sources' to precommit. > * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) > to precommit. > * (/) javadoc compilation > Hard-to-implement stuff already investigated: > * (/) (done) -*Printing console output of failed tests.* There doesn't seem > to be any way to do this in a reasonably efficient way. There are onOutput > listeners but they're slow to operate and solr tests emit *tons* of output so > it's an overkill.- > * (!) (LUCENE-9120) *Tests working with security-debug logs or other > JVM-early log output*. Gradle's test runner works by redirecting Java's > stdout/ syserr so this just won't work. Perhaps we can spin the ant-based > test runner for such corner-cases. > Of lesser importance: > * Add an equivalent of 'documentation-lint" to precommit. > * (/) Do not require files to be committed before running precommit. (staged > files are fine). > * (/) add rendering of javadocs (gradlew javadoc) > * Attach javadocs to maven publications. > * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid > it'll be difficult to run it sensibly because gradle doesn't offer cwd > separation for the forked test runners. > * if you diff solr packaged distribution against ant-created distribution > there are minor differences in library versions and some JARs are excluded/ > moved around. I didn't try to force these as everything seems to work (tests, > etc.) – perhaps these differences should be fixed in the ant build instead. > * [EOE] identify and port various "regenerate" tasks from ant builds > (javacc, precompiled automata, etc.) > * Fill in POM details in gradle/defaults-maven.gradle so that they reflect > the previous content better (dependencies aside). > * Add any IDE integration layers that should be added (I use IntelliJ and it > imports the project out of the box, without the need for any special tuning). > * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; > currently XSLT...) > * I didn't bother adding Solr dist/test-framework to packaging (who'd use it > from a binary distribution? >
[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula
[ https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031756#comment-17031756 ] Munendra S N commented on SOLR-11725: - Thanks [~hossman] for the review. I will share updated patch with changes entry including upgrade notes. {quote}Wait... the conversation from 2017 wasn't resoved? What do we want to do about stddev of singleton sets? Solr currently returns 0.0, and Hoss seemed to think this was the right behavior. But the patch here would seem to change the behavior to return NaN (but I didn't test it...) . After a quick glance, it doesn't look like existing tests cover this case either?{quote} Thanks [~ysee...@gmail.com] for the review. There are no tests to cover singleton case, I'm not sure if sample size is 0 is covered too. I think changing the current behavior of singleton case should be taken up in separate issue as it concerns both the Classical stats and JSON aggregations. This patch doesn't change the current behavior, I will add tests to cover these cases > json.facet's stddev() function should be changed to use the "Corrected sample > stddev" formula > - > > Key: SOLR-11725 > URL: https://issues.apache.org/jira/browse/SOLR-11725 > Project: Solr > Issue Type: Sub-task > Components: Facet Module >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-11725.patch, SOLR-11725.patch > > > While working on some equivalence tests/demonstrations for > {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} > calculations done between the two code paths can be measurably different, and > realized this is due to them using very different code... > * {{json.facet=foo:stddev(foo)}} > ** {{StddevAgg.java}} > ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}} > * {{stats.field=\{!stddev=true\}foo}} > ** {{StatsValuesFactory.java}} > ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - > 1.0D)))}} > Since I"m not really a math guy, I consulting with a bunch of smart math/stat > nerds I know online to help me sanity check if these equations (some how) > reduced to eachother (In which case the discrepancies I was seeing in my > results might have just been due to the order of intermediate operation > execution & floating point rounding differences). > They confirmed that the two bits of code are _not_ equivalent to each other, > and explained that the code JSON Faceting is using is equivalent to the > "Uncorrected sample stddev" formula, while StatsComponent's code is > equivalent to the the "Corrected sample stddev" formula... > https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation > When I told them that stuff like this is why no one likes mathematicians and > pressed them to explain which one was the "most canonical" (or "most > generally applicable" or "best") definition of stddev, I was told that: > # This is something statisticians frequently disagree on > # Practically speaking the diff between the calculations doesn't tend to > differ significantly when count is "very large" > # _"Corrected sample stddev" is more appropriate when comparing two > distributions_ > Given that: > * the primary usage of computing the stddev of a field/function against a > Solr result set (or against a sub-set of results defined by a facet > constraint) is probably to compare that distribution to a different Solr > result set (or to compare N sub-sets of results defined by N facet > constraints) > * the size of the sets of documents (values) can be relatively small when > computing stats over facet constraint sub-sets > ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected > sample stddev" equation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14007) Difference response format for percentile aggregation
[ https://issues.apache.org/jira/browse/SOLR-14007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031758#comment-17031758 ] Munendra S N commented on SOLR-14007: - [~ysee...@gmail.com] WDYT? > Difference response format for percentile aggregation > - > > Key: SOLR-14007 > URL: https://issues.apache.org/jira/browse/SOLR-14007 > Project: Solr > Issue Type: Sub-task > Components: Facet Module >Reporter: Munendra S N >Priority: Major > > For percentile, > In Stats component, the response format for percentile is {{NamedList}} but > in JSON facet, the format is either array or single value depending on number > of percentiles specified. > Even if JSON percentile doesn't use NamedList, response format shouldn't > change based on number of percentiles -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031765#comment-17031765 ] Dawid Weiss commented on LUCENE-9201: - Please leave a patch or pull request. I will review and provide feedback but no earlier than Sunday/Monday. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields
markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375974370 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } + docLengths[numDocsInCurrentBlock] = v.length; + block = ArrayUtil.grow(block, uncompressedBlockLength + v.length); + System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, v.length); + uncompressedBlockLength += v.length; + numDocsInCurrentBlock++; + if (numDocsInCurrentBlock == Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) { +flushData(); + } } -meta.writeInt(numDocsWithField); -meta.writeInt(minLength); -meta.writeInt(maxLength); -if (maxLength > minLength) { - start = data.getFilePointer(); - meta.writeLong(start); +private void flushData() throws IOException { + if(numDocsInCurrentBlock > 0) { +// Write offset to this block to temporary offsets file +totalChunks++; +long thisBlockStartPointer = data.getFilePointer(); +data.writeVInt(numDocsInCurrentBlock); +for (int i = 0; i < numDocsInCurrentBlock; i++) { Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@
[jira] [Commented] (SOLR-12930) Add developer documentation to source repo
[ https://issues.apache.org/jira/browse/SOLR-12930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031831#comment-17031831 ] Cassandra Targett commented on SOLR-12930: -- Sorry [~jpountz]. You're correct, they should not be included in binary artifacts. I don't know how to exclude them in either the Ant or Gradle builds, though. The only thing I really know how to do here would be to revert the whole thing and someone else could take a stab at doing this some other time. > Add developer documentation to source repo > -- > > Key: SOLR-12930 > URL: https://issues.apache.org/jira/browse/SOLR-12930 > Project: Solr > Issue Type: Improvement > Components: Tests >Reporter: Mark Miller >Priority: Major > Attachments: solr-dev-docs.zip > > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob edited a comment on issue #1217: SOLR-14223 PublicKeyHandler consumes a lot of entropy during tests
madrob edited a comment on issue #1217: SOLR-14223 PublicKeyHandler consumes a lot of entropy during tests URL: https://github.com/apache/lucene-solr/pull/1217#issuecomment-582087251 Wired this up so that we can get the keys loaded from disk - this setup seems to work for tests in `core` but not in `solrj` or `contrib` modules because they have different test sources? Should I copy the keys to each module, or is there a more elegant way to handle that? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14223) PublicKeyHandler consumes a lot of entropy during tests
[ https://issues.apache.org/jira/browse/SOLR-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031847#comment-17031847 ] Mike Drob commented on SOLR-14223: -- Based on [~rcmuir]'s suggestions, we can read a key from disk instead of generating a new one each time in our tests - this has the advantage of skipping the entropy consumption and also the expensive primality testing. PR is ready for final review if anybody is interested in taking a look. > PublicKeyHandler consumes a lot of entropy during tests > --- > > Key: SOLR-14223 > URL: https://issues.apache.org/jira/browse/SOLR-14223 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.4, 8.0 >Reporter: Mike Drob >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > After the changes in SOLR-12354 to eagerly create a {{PublicKeyHandler}} for > the CoreContainer, the creation of the underlying {{RSAKeyPair}} uses > {{SecureRandom}} to generate primes. This eats up a lot of system entropy and > can slow down tests significantly (I observed it adding 10s to an individual > test). > Similar to what we do for SSL config for tests, we can swap in a non blocking > implementation of SecureRandom for the key pair generation to allow multiple > tests to run better in parallel. Primality testing with BigInteger is also > slow, so I'm not sure how much total speedup we can get here, maybe it's > worth checking if there are faster implementations out there in other > libraries. > In production cases, this also blocks creation of all cores. We should only > create the Handler if necessary, i.e. if the existing authn/z tell us that > they won't support internode requests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14219) OverseerSolrResponse's serialVersionUID has changed
[ https://issues.apache.org/jira/browse/SOLR-14219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomas Eduardo Fernandez Lobbe updated SOLR-14219: - Resolution: Fixed Status: Resolved (was: Patch Available) > OverseerSolrResponse's serialVersionUID has changed > --- > > Key: SOLR-14219 > URL: https://issues.apache.org/jira/browse/SOLR-14219 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.5 >Reporter: Andy Webb >Assignee: Tomas Eduardo Fernandez Lobbe >Priority: Major > Fix For: 8.5 > > Time Spent: 3h > Remaining Estimate: 0h > > When the {{useUnsafeOverseerResponse=true}} option introduced in SOLR-14095 > is used, the serialized OverseerSolrResponse has a different serialVersionUID > to earlier versions, making it backwards-incompatible. > https://github.com/apache/lucene-solr/pull/1210 forces the serialVersionUID > to its old value, so old and new nodes become compatible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031858#comment-17031858 ] Robert Muir commented on LUCENE-9201: - [~tomoko] has a pull request for this issue already: https://github.com/apache/lucene-solr/pull/1242 I will do some investigation too. Maybe the actual javadocs/pythonscripts side can be cleaned up to make this less painful. For example it is not good that some python checks only can parse old javadocs format that has been removed since java 13. So even with the current ant build, things are not in great shape. Ideally java's doclint would be leaned on more (e.g. enable html check, remove jtidy crap). It will make builds faster and reduce maintenance. Gotta at least try :) > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy
Robert Muir created LUCENE-9209: --- Summary: fix javadocs to be html5, enable doclint html checks, remove jtidy Key: LUCENE-9209 URL: https://issues.apache.org/jira/browse/LUCENE-9209 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Currently doclint is very angry about all the {{}} elements and similar stuff going on. We claim to be emitting html5 documentation so it is about time to clean it up. Then the html check can simply be enabled and we can remove the jtidy stuff completely. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy
[ https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031876#comment-17031876 ] Robert Muir commented on LUCENE-9209: - I'm working on some of the common issues (such as tt tag -> code tag, table summary attribute -> caption element, etc) > fix javadocs to be html5, enable doclint html checks, remove jtidy > -- > > Key: LUCENE-9209 > URL: https://issues.apache.org/jira/browse/LUCENE-9209 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Currently doclint is very angry about all the {{}} elements and similar > stuff going on. We claim to be emitting html5 documentation so it is about > time to clean it up. > Then the html check can simply be enabled and we can remove the jtidy stuff > completely. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob merged pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize
madrob merged pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize URL: https://github.com/apache/lucene-solr/pull/1184 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9142) Add documentation to Operations.determinize, SortedIntSet, and FrozenSet
[ https://issues.apache.org/jira/browse/LUCENE-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031889#comment-17031889 ] ASF subversion and git services commented on LUCENE-9142: - Commit abd282d258d23d19b7f7c1e96332a19fa7b7b827 in lucene-solr's branch refs/heads/master from Mike [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=abd282d ] LUCENE-9142 Refactor IntSet operations for determinize (#1184) * LUCENE-9142 Refactor SortedIntSet for equality Split SortedIntSet into a class heirarchy to make comparisons to FrozenIntSet more meaningful. Use Arrays.equals for more efficient comparison. Add tests for IntSet to verify correctness. > Add documentation to Operations.determinize, SortedIntSet, and FrozenSet > > > Key: LUCENE-9142 > URL: https://issues.apache.org/jira/browse/LUCENE-9142 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Mike Drob >Priority: Major > Time Spent: 4h 10m > Remaining Estimate: 0h > > Was tracing through the fuzzy query code, and IntelliJ helpfully pointed out > that we have mismatched types when trying to reuse states, and so we may be > creating more states than we need to. > Relevant snippets: > {code:title=Operations.java} > Map newstate = new HashMap<>(); > final SortedIntSet statesSet = new SortedIntSet(5); > Integer q = newstate.get(statesSet); > {code} > {{q}} is always going to be null in this path because there are no > SortedIntSet keys in the map. > There are also very little javadoc on SortedIntSet, so I'm having trouble > following the precise relationship between all the pieces here. > cc: [~mikemccand] [~romseygeek] - I would appreciate any pointers if you have > them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9142) Add documentation to Operations.determinize, SortedIntSet, and FrozenSet
[ https://issues.apache.org/jira/browse/LUCENE-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031890#comment-17031890 ] ASF subversion and git services commented on LUCENE-9142: - Commit abd282d258d23d19b7f7c1e96332a19fa7b7b827 in lucene-solr's branch refs/heads/master from Mike [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=abd282d ] LUCENE-9142 Refactor IntSet operations for determinize (#1184) * LUCENE-9142 Refactor SortedIntSet for equality Split SortedIntSet into a class heirarchy to make comparisons to FrozenIntSet more meaningful. Use Arrays.equals for more efficient comparison. Add tests for IntSet to verify correctness. > Add documentation to Operations.determinize, SortedIntSet, and FrozenSet > > > Key: LUCENE-9142 > URL: https://issues.apache.org/jira/browse/LUCENE-9142 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Mike Drob >Priority: Major > Time Spent: 4h 10m > Remaining Estimate: 0h > > Was tracing through the fuzzy query code, and IntelliJ helpfully pointed out > that we have mismatched types when trying to reuse states, and so we may be > creating more states than we need to. > Relevant snippets: > {code:title=Operations.java} > Map newstate = new HashMap<>(); > final SortedIntSet statesSet = new SortedIntSet(5); > Integer q = newstate.get(statesSet); > {code} > {{q}} is always going to be null in this path because there are no > SortedIntSet keys in the map. > There are also very little javadoc on SortedIntSet, so I'm having trouble > following the precise relationship between all the pieces here. > cc: [~mikemccand] [~romseygeek] - I would appreciate any pointers if you have > them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9142) Add documentation to Operations.determinize, SortedIntSet, and FrozenSet
[ https://issues.apache.org/jira/browse/LUCENE-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Drob resolved LUCENE-9142. --- Fix Version/s: master (9.0) Assignee: Mike Drob Resolution: Fixed > Add documentation to Operations.determinize, SortedIntSet, and FrozenSet > > > Key: LUCENE-9142 > URL: https://issues.apache.org/jira/browse/LUCENE-9142 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Fix For: master (9.0) > > Time Spent: 4h 10m > Remaining Estimate: 0h > > Was tracing through the fuzzy query code, and IntelliJ helpfully pointed out > that we have mismatched types when trying to reuse states, and so we may be > creating more states than we need to. > Relevant snippets: > {code:title=Operations.java} > Map newstate = new HashMap<>(); > final SortedIntSet statesSet = new SortedIntSet(5); > Integer q = newstate.get(statesSet); > {code} > {{q}} is always going to be null in this path because there are no > SortedIntSet keys in the map. > There are also very little javadoc on SortedIntSet, so I'm having trouble > following the precise relationship between all the pieces here. > cc: [~mikemccand] [~romseygeek] - I would appreciate any pointers if you have > them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14247) IndexSizeTriggerMixedBoundsTest does a lot of sleeping
Mike Drob created SOLR-14247: Summary: IndexSizeTriggerMixedBoundsTest does a lot of sleeping Key: SOLR-14247 URL: https://issues.apache.org/jira/browse/SOLR-14247 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: Tests Reporter: Mike Drob When I run tests locally, the slowest reported test is always IndexSizeTriggerMixedBoundsTest coming in at around 2 minutes. I took a look at the code and discovered that at least 80s of that is all sleeps! There might need to be more synchronization and ordering added back in, but when I removed all of the sleeps the test still passed locally for me, so I'm not too sure what the point was or why we were slowing the system down so much. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14162) TestInjection can leak Timer objects
[ https://issues.apache.org/jira/browse/SOLR-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Drob resolved SOLR-14162. -- Fix Version/s: master (9.0) Resolution: Fixed > TestInjection can leak Timer objects > > > Key: SOLR-14162 > URL: https://issues.apache.org/jira/browse/SOLR-14162 > Project: Solr > Issue Type: Bug > Components: Tests >Reporter: Mike Drob >Priority: Minor > Fix For: master (9.0) > > Time Spent: 20m > Remaining Estimate: 0h > > In TestInjection we track all of the outstanding timers for shutdown but try > to clean up based on the TimerTask instead of the Timer itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy
[ https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032021#comment-17032021 ] Erick Erickson commented on LUCENE-9209: Are you including Solr subtree? If not, I can grab that part. > fix javadocs to be html5, enable doclint html checks, remove jtidy > -- > > Key: LUCENE-9209 > URL: https://issues.apache.org/jira/browse/LUCENE-9209 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Currently doclint is very angry about all the {{}} elements and similar > stuff going on. We claim to be emitting html5 documentation so it is about > time to clean it up. > Then the html check can simply be enabled and we can remove the jtidy stuff > completely. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031461#comment-17031461 ] Xin-Chun Zhang edited comment on LUCENE-9004 at 2/7/20 1:46 AM: ??You don't share your test code, but I suspect you open new IndexReader every time you issue a query??? [~tomoko] The test code can be found in [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java]. Yes, I opened a new reader for each query in hope that IVFFlat and HNSW are compared in a fair condition since IVFFlat do not have cache. I now realize it may lead to OOM, hence replacing with a shared IndexReader and the problem resolved. Update – Top 1 in-set (query vector is in the candidate data set) recall results on SIFT1M data set ([http://corpus-texmex.irisa.fr/]) of IVFFlat and HNSW are as follows, IVFFlat (no cache, reuse IndexReader) ||nprobe||avg. search time (ms)||recall percent (%)|| |8|13.3165|64.8| |16|13.968|79.65| |32|16.951|89.3| |64|21.631|95.6| |128|31.633|98.8| HNSW (static cache, reuse IndexReader) ||avg. search time (ms)||recall percent (%)|| |6.3|{color:#ff}20.45{color}| It can readily be shown that HNSW performs much better in query time. But I was surprised that top 1 in-set recall percent of HNSW is so low. It shouldn't be a problem of algorithm itself, but more likely a problem of implementation or test code. I will check it this weekend. was (Author: irvingzhang): ??You don't share your test code, but I suspect you open new IndexReader every time you issue a query??? [~tomoko] The test code can be found in [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java]. Yes, I opened a new reader for each query in hope that IVFFlat and HNSW are compared in a fair condition since IVFFlat do not have cache. I now realize it may lead to OOM, hence replacing with a shard IndexReader and the problem resolved. Update -- Top 1 in-set (query vector is in the candidate data set) recall results on SIFT1M data set ([http://corpus-texmex.irisa.fr/]) of IVFFlat and HNSW are as follows, IVFFlat (no cache, reuse IndexReader) ||nprobe||avg. search time (ms)||recall percent (%)|| |8|13.3165|64.8| |16|13.968|79.65| |32|16.951|89.3| |64|21.631|95.6| |128|31.633|98.8| HNSW (static cache, reuse IndexReader) ||avg. search time (ms)||recall percent (%)|| |6.3|{color:#FF}20.45{color}| It can readily be shown that HNSW performs much better in query time. But I was surprised that top 1 in-set recall percent of HNSW is so low. It shouldn't be a problem of algorithm itself, but more likely a problem of implementation or test code. I will check it this weekend. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The i
[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy
[ https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032061#comment-17032061 ] Robert Muir commented on LUCENE-9209: - Yes, currently I am fixing solr parts too (so far, but not yet sure what I am walking into! So I may need help, lemme see if i can get lucene-core working first). I look at each violation in lucene-core and then fix it across the entire source tree. For example table "summary" attribute is complained about. I look up https://developer.mozilla.org/en-US/docs/Web/HTML/Element/table and see that it says: {quote} This attribute defines an alternative text that summarizes the content of the table. Use the element instead. {quote} So then I fix all occurrences in the whole source tree of table "summary" attribute by transforming into a caption, like this: {noformat} - * + * + * comparison of dictionary and hyphenation based decompounding {noformat} I'll upload a patch with my current state. I still have 16 violations in lucene-core. > fix javadocs to be html5, enable doclint html checks, remove jtidy > -- > > Key: LUCENE-9209 > URL: https://issues.apache.org/jira/browse/LUCENE-9209 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Currently doclint is very angry about all the {{}} elements and similar > stuff going on. We claim to be emitting html5 documentation so it is about > time to clean it up. > Then the html check can simply be enabled and we can remove the jtidy stuff > completely. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy
[ https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9209: Attachment: LUCENE-9209_current_state.patch > fix javadocs to be html5, enable doclint html checks, remove jtidy > -- > > Key: LUCENE-9209 > URL: https://issues.apache.org/jira/browse/LUCENE-9209 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9209_current_state.patch > > > Currently doclint is very angry about all the {{}} elements and similar > stuff going on. We claim to be emitting html5 documentation so it is about > time to clean it up. > Then the html check can simply be enabled and we can remove the jtidy stuff > completely. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9210) gradle javadocs doesn't incorporate CSS/JS
Robert Muir created LUCENE-9210: --- Summary: gradle javadocs doesn't incorporate CSS/JS Key: LUCENE-9210 URL: https://issues.apache.org/jira/browse/LUCENE-9210 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir We add to the javadoc css/jss some stuff: * Prettify.css/js (syntax highlighting) * a few styles to migrate table cellpadding: LUCENE-9209 The ant task concatenates the stuff to the end of the resulting javadocs css/js. We should either do this also in the gradle build or remove our reliance on this stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy
[ https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9209: Attachment: LUCENE-9209.patch > fix javadocs to be html5, enable doclint html checks, remove jtidy > -- > > Key: LUCENE-9209 > URL: https://issues.apache.org/jira/browse/LUCENE-9209 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9209.patch, LUCENE-9209_current_state.patch > > > Currently doclint is very angry about all the {{}} elements and similar > stuff going on. We claim to be emitting html5 documentation so it is about > time to clean it up. > Then the html check can simply be enabled and we can remove the jtidy stuff > completely. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy
[ https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032077#comment-17032077 ] Robert Muir commented on LUCENE-9209: - patch attached: javadocs succeeds for all of lucene and solr. The html doclint option is enabled for gradle and ant, and all jtidy stuff is removed. I plan to commit this soon as it is about as boring as it gets, but will conflict with everything. > fix javadocs to be html5, enable doclint html checks, remove jtidy > -- > > Key: LUCENE-9209 > URL: https://issues.apache.org/jira/browse/LUCENE-9209 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9209.patch, LUCENE-9209_current_state.patch > > > Currently doclint is very angry about all the {{}} elements and similar > stuff going on. We claim to be emitting html5 documentation so it is about > time to clean it up. > Then the html check can simply be enabled and we can remove the jtidy stuff > completely. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy
[ https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032088#comment-17032088 ] ASF subversion and git services commented on LUCENE-9209: - Commit 0d339043e378d8333c376bae89411b813de25b10 in lucene-solr's branch refs/heads/master from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0d33904 ] LUCENE-9209: fix javadocs to be html5, enable doclint html checks, remove jtidy Current javadocs declare an HTML5 doctype: !DOCTYPE HTML. Some HTML5 features are used, but unfortunately also some constructs that do not exist in HTML5 are used as well. Because of this, we have no checking of any html syntax. jtidy is disabled because it works with html4. doclint is disabled because it works with html5. our docs are neither. javadoc "doclint" feature can efficiently check that the html isn't crazy. we just have to fix really ancient removed/deprecated stuff (such as use of tt tag). This enables the html checking in both ant and gradle. The docs are fixed via straightforward transformations. One exception is table cellpadding, for this some helper CSS classes were added to make the transition easier (since it must apply padding to inner th/td, not possible inline). I added TODOs, we should clean this up. Most problems look like they may have been generated from a GUI or similar and not a human. > fix javadocs to be html5, enable doclint html checks, remove jtidy > -- > > Key: LUCENE-9209 > URL: https://issues.apache.org/jira/browse/LUCENE-9209 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9209.patch, LUCENE-9209_current_state.patch > > > Currently doclint is very angry about all the {{}} elements and similar > stuff going on. We claim to be emitting html5 documentation so it is about > time to clean it up. > Then the html check can simply be enabled and we can remove the jtidy stuff > completely. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy
[ https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032092#comment-17032092 ] ASF subversion and git services commented on LUCENE-9209: - Commit 860115e4502175934bb1d7ae90f8bda65c464bb9 in lucene-solr's branch refs/heads/master from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=860115e ] LUCENE-9209: revert changes to test html file, not intended > fix javadocs to be html5, enable doclint html checks, remove jtidy > -- > > Key: LUCENE-9209 > URL: https://issues.apache.org/jira/browse/LUCENE-9209 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9209.patch, LUCENE-9209_current_state.patch > > > Currently doclint is very angry about all the {{}} elements and similar > stuff going on. We claim to be emitting html5 documentation so it is about > time to clean it up. > Then the html check can simply be enabled and we can remove the jtidy stuff > completely. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032100#comment-17032100 ] Robert Muir commented on LUCENE-9201: - {quote} Package summary: "ant documentation" uses "package.html" as package summary description, but "gradlew javadoc" ignores "package.html" (so some packages lacks summary description in "package-summary.html" when building javadocs by Gradle). We might be able to make Gradle Javadoc task to properly handle "package.html" files with some options. Or, should we replace all "package.html" with "package-info.java" at this time? {quote} It is a stupid complex issue. The problem here also exists in the ant build. The underlying issue is that java 8 produced HTML4 by default. Java 9+ is doing HTML5. Java 9+ is generating HTML5 even though its manual page implies that HTML4 is still the default. This kind of shit makes our build too complicated. Possibly in branch 8.x we can simply pass "-html4" option to javadoc processor so that it always generates html4 output, even if you happen to use java 9,10,11,12,13,etc to invoke it. Currently if you use java 9+ on this branch, javadocs are messed up, you get no overview.html, etc. Forcing html4 for this branch seems like the best way. I will investigate for sure. On the other hand with master branch things are easier: java 11 is a minimum requirement, so let's not fight the defaults (which is HTML5). Maybe I am wrong here, if we change our minds we can just revert my commits and wire in HTML4. I fixed the syntax issues already in LUCENE-9209. We must get the overviews working again, so I am not sure if {{package-info.java}} is the best solution (it addresses package.html, but what about overview.html?) Sorry for the long answer, but at least it is no mystery. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032103#comment-17032103 ] Robert Muir commented on LUCENE-9201: - With java 13+, the {{-html4}} option is removed. So we are forced to adopt html5. I feel a little better :) Fixing branch_8x is hopeless: java 8 can only generate html4 and java13 can only generate html5. So now we should focus on fixing package.html/overview.html in the master branch. I will look into it a bit. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org