[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314734#comment-17314734 ] Robert Muir commented on LUCENE-9855: - the "strategy" is a huge antipattern. lets split into separate codecs so that ?Format has a real api. right now its too difficult to improve the implementation (and there are massive memory inefficiencies) or even provide different options. too much stuff tangled into one format. its so bad that we cant even name it. > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314794#comment-17314794 ] Tomoko Uchida commented on LUCENE-9855: --- {quote}right now its too difficult to improve the implementation (and there are massive memory inefficiencies) or even provide different options. too much stuff tangled into one format. its so bad that we cant even name it. {quote} I didn't intend to discuss about implementations at here but try to deal with the naming issue, I understand the current status of the vector search (or hnsw) is not perfect though. It sounds like that we can not only name it but also ship it with 9.0 to me. If so, I'm really afraid to say but the discussion seem to go beyond this issue (and me); should we return to LUCENE-9004 or LUCENE-9322 to treat the fundamental question. > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9902) Update faceting API to use modern Java features
[ https://issues.apache.org/jira/browse/LUCENE-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314853#comment-17314853 ] Michael McCandless commented on LUCENE-9902: {quote}I wonder if we need a specific 8.9 CHANGES.txt entry for this change so that it gets picked up in the 8.9 release? {quote} What we typically do is, in {{main}} branch, put a {{CHANGES.txt}} entry under the {{8.9.0}} release section (not in the {{9.0}} section!). And then backport. This way there is only one entry for each change, appearing under the earliest release that first got that feature/fix. > Update faceting API to use modern Java features > --- > > Key: LUCENE-9902 > URL: https://issues.apache.org/jira/browse/LUCENE-9902 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gautam Worah >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I was using the {{public int getOrdinal(String dim, String[] path)}} API for > a single {{path}} String and found myself creating an array with a single > element. We can start using variable length args for this method. > I also propose this change: > I wanted to know the specific count of an ordinal using using the > {{getValue}} API from {{IntTaxonomyFacets}} but the method is private. It > would be good if we could change it to {{protected}} so that users can know > the value of an ordinal without looking up the {{FacetLabel}} and then > checking its value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9903) Search is not working while migrating Lucene 3.6.2 to 8.7.0
[ https://issues.apache.org/jira/browse/LUCENE-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314858#comment-17314858 ] Michael McCandless commented on LUCENE-9903: Lots of things changed between Lucene 3.6.2 and 8.7.0! But this is not yet issue-worthy, until we can isolate to a specific problem. So, could you instead send an email to the java users list ({{java-u...@lucene.apache.org}}) and include more details? Maybe make a tiny example program (or real unit test) showing the issue? > Search is not working while migrating Lucene 3.6.2 to 8.7.0 > --- > > Key: LUCENE-9903 > URL: https://issues.apache.org/jira/browse/LUCENE-9903 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.7 >Reporter: lakshman >Priority: Major > > We are upgrading the Lucene version from 3.6.2 to 8.7.0 version, we are > facing the search issue, once request is reach to search method then response > is not coming out the execution flow is blocked by search method > public void collect(IndexSearcher is) throws CorruptIndexException,public > void collect(IndexSearcher is) throws CorruptIndexException, IOException { > TopFieldCollector collector = TopFieldCollector.create(TopFieldCollector > collector = TopFieldCollector.create( reverse ? reverseSort : sort, numr, > val); > is.setSimilarity(new ClassicSimilarity()); > is.search(query, numr); > } > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314865#comment-17314865 ] Michael McCandless commented on LUCENE-9850: Impressive results! And I love the flame charts that now come builtin with JMC/JFR in the JDK! I would expect {{XTerm}} to show speedups since this is largely dominated by decoding many postings blocks. But it is odd to see the {{XTermYSort}} tasks negatively impacted: those tasks are just sorting by a {{DocValues}} field instead of default text relevance (BM25). Net/net I think giving the impressive improvement in compression, speedup in raw decode of postings blocks, I think this is worth doing? These results are {{wikimediumall}} right? If you re-run but with only the tasks that regressed above, do they still show regression? I'm wondering if hotspot compilation noise is contributing ... > Explore PFOR for Doc ID delta encoding (instead of FOR) > --- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Attachments: apply_exceptions.png, bulk_read_1.png, bulk_read_2.png, > for.png, pfor.png > > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #56: LUCENE-9883: Turn on ecj missingEnumCaseDespiteDefault setting
mikemccand commented on pull request #56: URL: https://github.com/apache/lucene/pull/56#issuecomment-813397017 Thank you for looking into backporting! I think it's fine to leave this as 9.x / Lucene only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9573) Add back compat tests for VectorFormat to TestBackwardsCompatibility
[ https://issues.apache.org/jira/browse/LUCENE-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-9573: --- Fix Version/s: (was: main (9.0)) 9.1 > Add back compat tests for VectorFormat to TestBackwardsCompatibility > > > Key: LUCENE-9573 > URL: https://issues.apache.org/jira/browse/LUCENE-9573 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Blocker > Fix For: 9.1 > > > In LUCENE-9322 we add a new VectorFormat to the index. This issue is about > adding backwards compatibility tests for it once the index format has > crystallized into its 9.0 form -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9573) Add back compat tests for VectorFormat to TestBackwardsCompatibility
[ https://issues.apache.org/jira/browse/LUCENE-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314874#comment-17314874 ] Michael McCandless commented on LUCENE-9573: Sorry, yes, you are right [~jpountz]! This is 9.1 blocker. I'll change the fix version! Sorry for the confusion :) > Add back compat tests for VectorFormat to TestBackwardsCompatibility > > > Key: LUCENE-9573 > URL: https://issues.apache.org/jira/browse/LUCENE-9573 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Blocker > Fix For: main (9.0) > > > In LUCENE-9322 we add a new VectorFormat to the index. This issue is about > adding backwards compatibility tests for it once the index format has > crystallized into its 9.0 form -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314876#comment-17314876 ] Tomoko Uchida commented on LUCENE-9855: --- I understand the vector search is a kind of incubation project; I remember at the very early stage I hoped we could start it from the sandbox project but it was not possible with the apis. Here I've tried to find some level of consensus among us, but my attempt have not seem to be going well. :) (nevertheless I think I am inclined to support Julie's perspective or design on this.) I can't say "it's a blocker so just fix it" nor simply throw away this, I'd like to wait and see to hear from others for a little while. > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9857) Skip cache building if IndexOrDocValuesQuery choose the dvQuery
[ https://issues.apache.org/jira/browse/LUCENE-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314879#comment-17314879 ] Feng Guo commented on LUCENE-9857: -- Thank you for the reply! I'm so sorry that i did not notice this code you point out. I'm digging into a case that queries become slower after a forcemerge, but obviously this issue is not the root cause now. Thank you again for the explaination! > Skip cache building if IndexOrDocValuesQuery choose the dvQuery > --- > > Key: LUCENE-9857 > URL: https://issues.apache.org/jira/browse/LUCENE-9857 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0) >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > IndexOrDocValuesQuery can automatically use dvQueries when the cost > 8 *l > eadcost, And the LRUQueryCache skips cache building when cost > 250(By > default) * leadcost. There is a gap between 8 and 250, which means if the > factor is just between 8 and 250 (e.g. cost = 10 * leadcost), the > IndexOrDocValueQuery will choose the dvQueries but LRUQueryCache still build > cache for it. > IndexOrDocValuesQuery aims to speed up queries when the leadcost is small, > but building cache by dvScorers can make it meaningless because it needs to > scan all the docvalues. This can be rather slow for big segments, so maybe we > should skip the cache building for IndexOrDocValuesQuery when it chooses > dvQueries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 closed pull request #29: LUCENE-9857: Skip cache building if IndexOrDocValuesQuery choose the dvQuery
gf2121 closed pull request #29: URL: https://github.com/apache/lucene/pull/29 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9857) Skip cache building if IndexOrDocValuesQuery choose the dvQuery
[ https://issues.apache.org/jira/browse/LUCENE-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo resolved LUCENE-9857. -- Resolution: Fixed > Skip cache building if IndexOrDocValuesQuery choose the dvQuery > --- > > Key: LUCENE-9857 > URL: https://issues.apache.org/jira/browse/LUCENE-9857 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0) >Reporter: Feng Guo >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > IndexOrDocValuesQuery can automatically use dvQueries when the cost > 8 *l > eadcost, And the LRUQueryCache skips cache building when cost > 250(By > default) * leadcost. There is a gap between 8 and 250, which means if the > factor is just between 8 and 250 (e.g. cost = 10 * leadcost), the > IndexOrDocValueQuery will choose the dvQueries but LRUQueryCache still build > cache for it. > IndexOrDocValuesQuery aims to speed up queries when the leadcost is small, > but building cache by dvScorers can make it meaningless because it needs to > scan all the docvalues. This can be rather slow for big segments, so maybe we > should skip the cache building for IndexOrDocValuesQuery when it chooses > dvQueries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9889) Lucene (unexpected ) fsync on existing segments
[ https://issues.apache.org/jira/browse/LUCENE-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314885#comment-17314885 ] Michael McCandless commented on LUCENE-9889: Thank you for opening this issue [~rahul196...@gmail.com]! It is indeed weird that Lucene is re-opening segment files it already long ago wrote and close and fsync'd, to fsync them again. There is some fun/exciting history here. Long ago Lucene's {{IndexWriter}} used to keep track of which files were "dirty" (written recently and not yet fsync'd), but that was somehow complex and buggy and sometimes sprouted up bad memory leaks, and so at one point we moved that tracking from {{IndexWriter}} down into {{FSDirectory}}, but then somehow, later, we eventually just removed the dirty logic from {{FSDirectory}} and changed to always fsync'ing every file. I agree this is odd and we should perhaps revisit that dirty logic. Related issues: LUCENE-3237, LUCENE-5570, LUCENE-5588, LUCENE-6150 (this is where the dirty file tracking was removed from {{FSDirectory}}). > Lucene (unexpected ) fsync on existing segments > --- > > Key: LUCENE-9889 > URL: https://issues.apache.org/jira/browse/LUCENE-9889 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.7.2 >Reporter: Rahul Goswami >Priority: Major > > > If one of the existing segment files is opened by another (say a 3rd party) > process, it can causing a parallel commit to fail with an error complaining > about the index files to be locked by another process. Upon debugging, I see > that fsync is being called during commit on already existing segment files, > and failure to open the file in write mode causes this. But this should not > be an expected behavior since there is no reason for a commit to open an > existing segment file in WRITE mode to fsync. Please note that in this case, > the index file was also a part of a saved commit point, so there is all the > more reason to not fsync it. > > The line of code I am referring to is as below: > try (final FileChannel file = FileChannel.open(fileToSync, isDir ? > StandardOpenOption.READ : StandardOpenOption.WRITE)) > > in method fsync(Path fileToSync, boolean isDir) of the class file > > lucene\core\src\java\org\apache\lucene\util\IOUtils.java > > > Opening this Jira after discussion with Mike Candless and Michael Sokolov on > the dev mailing list here: > [Lucene - Java Developer - Lucene (unexpected ) fsync on existing segments > (nabble.com)|https://lucene.472066.n3.nabble.com/Lucene-unexpected-fsync-on-existing-segments-td4469731.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9892) Ensure @since tag for public classes
[ https://issues.apache.org/jira/browse/LUCENE-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314886#comment-17314886 ] Michael McCandless commented on LUCENE-9892: Hi [~tomoko], I saw the discussion in LUCENE-9890, that we should fully decouple {{@since}} from the existing {{@lucene.experimental}} but I though your idea here would still be interesting? I.e. it seems like there ought to be some simple static tooling that can detect when an API first appeared and insert {{@since}} tags into the source tree, maybe as part of pre-release process? > Ensure @since tag for public classes > > > Key: LUCENE-9892 > URL: https://issues.apache.org/jira/browse/LUCENE-9892 > Project: Lucene - Core > Issue Type: Sub-task > Components: general/javadocs >Reporter: Tomoko Uchida >Priority: Minor > > Can we ensure that all public classes' documentation have @since tag by some > precommit task using doclet API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9857) Skip cache building if IndexOrDocValuesQuery choose the dvQuery
[ https://issues.apache.org/jira/browse/LUCENE-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314949#comment-17314949 ] Julie Tibshirani commented on LUCENE-9857: -- No problem! > Skip cache building if IndexOrDocValuesQuery choose the dvQuery > --- > > Key: LUCENE-9857 > URL: https://issues.apache.org/jira/browse/LUCENE-9857 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0) >Reporter: Feng Guo >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > IndexOrDocValuesQuery can automatically use dvQueries when the cost > 8 *l > eadcost, And the LRUQueryCache skips cache building when cost > 250(By > default) * leadcost. There is a gap between 8 and 250, which means if the > factor is just between 8 and 250 (e.g. cost = 10 * leadcost), the > IndexOrDocValueQuery will choose the dvQueries but LRUQueryCache still build > cache for it. > IndexOrDocValuesQuery aims to speed up queries when the leadcost is small, > but building cache by dvScorers can make it meaningless because it needs to > scan all the docvalues. This can be rather slow for big segments, so maybe we > should skip the cache building for IndexOrDocValuesQuery when it chooses > dvQueries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #63: LUCENE-9901: UnicodeData.java has no regeneration task
dweiss merged pull request #63: URL: https://github.com/apache/lucene/pull/63 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9901) UnicodeData.java has no regeneration task
[ https://issues.apache.org/jira/browse/LUCENE-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315014#comment-17315014 ] ASF subversion and git services commented on LUCENE-9901: - Commit fbf9191abf2ad4acd26bae16e075cdeb79d33a39 in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fbf9191 ] LUCENE-9901: UnicodeData.java has no regeneration task (#63) > UnicodeData.java has no regeneration task > - > > Key: LUCENE-9901 > URL: https://issues.apache.org/jira/browse/LUCENE-9901 > Project: Lucene - Core > Issue Type: Sub-task > Components: modules/analysis >Reporter: Uwe Schindler >Assignee: Dawid Weiss >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > When moving build system to gradle, we lost the following groovy script, > which is used to regenerate the UnicodeData.java file to be in line with the > actually used ICU4J version. The groovy script is still in the repository, > but it's no longer used by the build system: > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy > To execute it, we need to convert it to a Gradle task that depends on the > same version of ICU4J that we use as analysis/icu dependency Not sure how to > do this, maybe it's easy using palantir). > The file should also be hashed and put into the regenerated file hases: > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java > Old Ant task is here: > https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9901) UnicodeData.java has no regeneration task
[ https://issues.apache.org/jira/browse/LUCENE-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-9901. - Fix Version/s: main (9.0) Resolution: Fixed > UnicodeData.java has no regeneration task > - > > Key: LUCENE-9901 > URL: https://issues.apache.org/jira/browse/LUCENE-9901 > Project: Lucene - Core > Issue Type: Sub-task > Components: modules/analysis >Reporter: Uwe Schindler >Assignee: Dawid Weiss >Priority: Major > Fix For: main (9.0) > > Time Spent: 1h > Remaining Estimate: 0h > > When moving build system to gradle, we lost the following groovy script, > which is used to regenerate the UnicodeData.java file to be in line with the > actually used ICU4J version. The groovy script is still in the repository, > but it's no longer used by the build system: > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy > To execute it, we need to convert it to a Gradle task that depends on the > same version of ICU4J that we use as analysis/icu dependency Not sure how to > do this, maybe it's easy using palantir). > The file should also be hashed and put into the regenerated file hases: > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java > Old Ant task is here: > https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315067#comment-17315067 ] Michael Sokolov commented on LUCENE-9855: - OK, naming is hard! I think it will help to break down all the classes we are (or might be) talking about here. These are the bulk of the classes/packages added as part of this vector/knn search effort: {code:java} o.a.l.codecs: VectorFormat, VectorReader, VectorWriter o.a.l.codecs.lucene90: Lucene90VectorFormat, Lucene90VectorReader, Lucene90VectorWriter o.a.l.index: VectorValues, VectorValuesWriter, RandomAccessVectorValues, RandomAccessVectorValuesProducer o.a.l.search: o.a.l.util.hnsw: HnswGraph, HnswGrahPbuilder, NeighborQueue, NeighborArray, BoundsChecker {code} I think the scope of this issue is basically – consider a more specific name for these vector apis (that isn't so easily confused with TermVectors), and use plural form. Then we got into a discussion of whether this format is hnsw-only, but [~julietibs] points out that (a) we already decided it would handle multiple ANN algos, and (b) we can have algorithm-specific names in the implementation classes (the ones in o.a.l.codecs.lucene90 + any associated utility classes) without needing to make that change anywhere else (at the interface level). [~rcmuir] also raised some other issues; one performance-related, another we should have this strategy pattern at all. I might have missed something else? I think those are separate issues though: Robert please feel free to open some other JIRA if you think we ought to pursue further? Given that, I think we are talking here about the names of: {code:java} o.a.l.codecs: VectorFormat, VectorReader, VectorWriter o.a.l.index: VectorValues, VectorValuesWriter, RandomAccessVectorValues, RandomAccessVectorValuesProducer {code} We seem to be evolving some consensus around {{NumericVectors}}. I think if we are going to have a plural root like that, it makes no sense to add {{Values}} after it (NumericVectorsValues?), and the "values" name was really just copied from DocValues - it's not adding anything I think. I'd like to just change "VectorValues" to "NumericVectors" and "Vector" to "NumericVectors" but this leaves to {{NumericVectorsWriter}} classes in different packages. Maybe we coulkd adopt the DocValues Producer/Consumer naming in the codecs package with this result: {code:java} o.a.l.codecs: NumericVectorsFormat, NumericVectorsProducer, NumericVectorsConsumer o.a.l.index: NumericVectors, NumericVectorsWriter, RandomAccessNumericVectors, RandomAccessNumericVectorsSupplier {code} > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315066#comment-17315066 ] Greg Miller commented on LUCENE-9850: - Thanks [~mikemccand]! For starters, yes—all the runs I referenced here are using "wikimediumall". {quote}I would expect {{XTerm}} to show speedups since this is largely dominated by decoding many postings blocks. But it is odd to see the {{XTermYSort}} tasks negatively impacted: those tasks are just sorting by a {{DocValues}} field instead of default text relevance (BM25). {quote} I might except the opposite actually. Anytime the PFOR approach has to apply exceptions, I would expect a performance hit of some sort since it has extra work to do on top of the FOR approach used today. So if the Term tasks are largely dominated by postings decoding, I would expect regressions to show up there more than elsewhere. Maybe I'm misunderstanding your comment though? I re-ran wikimediumall with only "Term" tasks and got the following (looks like noise to me): {code:java} TaskQPS baseline StdDevQPS pfordocids StdDev Pct diff p-value HighTermDayOfYearSort5.74 (10.8%)5.59 (9.9%) -2.6% ( -20% - 20%) 0.431 TermDTSort 44.54 (15.4%) 44.06 (14.3%) -1.1% ( -26% - 33%) 0.816 HighTermTitleBDVSort 30.90 (14.3%) 30.59 (13.5%) -1.0% ( -25% - 31%) 0.820 MedTerm 392.89 (6.8%) 389.45 (7.5%) -0.9% ( -14% - 14%) 0.699 LowTerm 412.80 (7.0%) 410.68 (7.9%) -0.5% ( -14% - 15%) 0.827 PKLookup 130.70 (2.8%) 131.44 (2.0%) 0.6% ( -4% -5%) 0.470 HighTermMonthSort 61.69 (12.2%) 62.13 (13.6%) 0.7% ( -22% - 30%) 0.860 HighTerm 381.73 (10.0%) 385.03 (7.8%) 0.9% ( -15% - 20%) 0.761 {code} I also pulled out all the tasks in my last wikimediumall run that had a significant change in either direction (p-value <= 0.05) and reran them alone to see if the results were repeatable. In general, they were not. The three that *were* repeatable (significant) regressions were LowSpanNear (-2.2%), AndHighMed (-2.1%) and AndHighHigh (-2.0%): {code:java} TaskQPS baseline StdDevQPS pfordocids StdDev Pct diff p-value HighTermTitleBDVSort 42.09 (10.6%) 41.04 (10.4%) -2.5% ( -21% - 20%) 0.451 LowSpanNear4.29 (1.9%)4.20 (1.4%) -2.2% ( -5% -1%) 0.000 AndHighMed 26.79 (3.0%) 26.23 (2.4%) -2.1% ( -7% -3%) 0.014 AndHighHigh 13.83 (3.4%) 13.54 (2.8%) -2.0% ( -7% -4%) 0.037 HighTerm 696.28 (6.7%) 688.02 (7.0%) -1.2% ( -13% - 13%) 0.585 OrNotHighHigh 372.40 (5.3%) 370.23 (5.3%) -0.6% ( -10% - 10%) 0.726 HighTermDayOfYearSort 13.53 (9.8%) 13.46 (10.4%) -0.5% ( -18% - 21%) 0.866 MedSpanNear 19.76 (1.7%) 19.67 (1.4%) -0.5% ( -3% -2%) 0.320 HighTermMonthSort 48.94 (11.2%) 48.77 (10.5%) -0.3% ( -19% - 24%) 0.921 LowTerm 714.77 (3.0%) 714.27 (5.6%) -0.1% ( -8% -8%) 0.960 PKLookup 139.66 (3.0%) 139.69 (2.4%) 0.0% ( -5% -5%) 0.984 IntNRQ 33.49 (1.4%) 33.54 (1.2%) 0.2% ( -2% -2%) 0.690 {code} Finally, I tried running a micro-benchmark to see if I could isolate how much regression there was due to applying the exceptions in PFOR compared to FOR. I forked [~jpountz]'s code originally used for optimizing FOR. The results are in the README over [here|https://github.com/gsmiller/decode-128-ints-benchmark/tree/pfor-delta] if interested. They generally make sense to me and show that performance does take a hit when there are exceptions to patch in, but I think we're seeing with the overall luceneutil benchmark's that those performance shifts aren't showing up in the bigger picture for the most part. So overall, it seems like the index size reduction might be worth it here, but I'm new to Lucene benchmarks and this is my first attempt at running a micro-benchmark, so I would trust the opinion of others with more experienced eyes here. > Explore PFOR for Doc ID delta encoding (instead of FOR) > --- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >
[GitHub] [lucene] gsmiller commented on a change in pull request #62: LUCENE-9902 Minor fixes to the faceting API
gsmiller commented on a change in pull request #62: URL: https://github.com/apache/lucene/pull/62#discussion_r607302178 ## File path: lucene/CHANGES.txt ## @@ -287,7 +287,10 @@ Other API Changes - -(No changes) + +* LUCENE-9902: Change the getValue method from IntTaxonomyFacets to be protected instead of private. Review comment: Looks great! Have you opened up a corresponding PR against the 8x branch in the old repository as well for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #62: LUCENE-9902 Minor fixes to the faceting API
gautamworah96 commented on a change in pull request #62: URL: https://github.com/apache/lucene/pull/62#discussion_r607372040 ## File path: lucene/CHANGES.txt ## @@ -287,7 +287,10 @@ Other API Changes - -(No changes) + +* LUCENE-9902: Change the getValue method from IntTaxonomyFacets to be protected instead of private. Review comment: I think the next branch for Lucene 8.9 will be cut from this repo? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #62: LUCENE-9902 Minor fixes to the faceting API
gsmiller commented on a change in pull request #62: URL: https://github.com/apache/lucene/pull/62#discussion_r607379305 ## File path: lucene/CHANGES.txt ## @@ -287,7 +287,10 @@ Other API Changes - -(No changes) + +* LUCENE-9902: Change the getValue method from IntTaxonomyFacets to be protected instead of private. Review comment: @gautamworah96 If I understand correctly, you need to actually submit a separate PR to this branch: https://github.com/apache/lucene-solr/tree/branch_8x -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a change in pull request #11: LUCENE-9334 Consistency of field data structures
jtibshirani commented on a change in pull request #11: URL: https://github.com/apache/lucene/pull/11#discussion_r607396127 ## File path: lucene/MIGRATE.md ## @@ -358,11 +358,21 @@ Rather, an IllegalArgumentException shall be thrown. This is introduced for bett defence and to ensure that there is no bubbling up of errors when Lucene is used in multi level applications -## Assumption of data consistency between different data-structures sharing the same field name +### Require consistency between data-structures on a per-field basis -Sorting on a numeric field that is indexed with both doc values and points may use an Review comment: I don't think this change enforces that the same values were passed to each field. So maybe we could keep this note about data consistency, as it's not fully covered under the new one "Require consistency between data structures..." I'm also curious if we plan to enforce value consistency in a follow-up? (Sorry if I'm missing something, I'm not fully up-to-date on the PR). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9905) Revise approach to specifying NN algorithm
Julie Tibshirani created LUCENE-9905: Summary: Revise approach to specifying NN algorithm Key: LUCENE-9905 URL: https://issues.apache.org/jira/browse/LUCENE-9905 Project: Lucene - Core Issue Type: Improvement Reporter: Julie Tibshirani In LUCENE-9322 we decided that the new vectors API shouldn’t assume a particular nearest-neighbor search data structure and algorithm. This flexibility is important since NN search is a developing area and we'd like to be able to experiment and evolve the algorithm. Right now we only have one algorithm (HNSW), but we want to maintain the ability to use another. Currently the algorithm to use is specified through {{SearchStrategy}}, for example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is expected to handle multiple algorithms. Instead we could have one format implementation per algorithm. Our current implementation would be HNSW-specific like {{HnswVectorFormat}}, and to experiment with another algorithm you could create a new implementation like {{ClusterVectorFormat}}. This would be better aligned with the codec framework, and help avoid exposing algorithm details in the API. A concrete proposal (note many of these names will change when LUCENE-9855 is addressed): 1. Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}. 2. Remove references to HNSW in {{SearchStrategy}}, so there is just {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like {{SimilarityFunction}}. 3. Remove {{FieldType}} attributes related to HNSW parameters (maxConn and beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}. 4. Introduce {{PerFieldVectorFormat}} to allow a different NN approach or parameters to be configured per-field (?) One note: the current HNSW-based format includes logic for storing a numeric vector per document, as well as constructing + storing a HNSW graph. When adding another implementation, it’d be nice to be able to reuse logic for reading/ writing numeric vectors. I don’t think we need to design for this right now, but we can keep it in mind for the future? This issue is based on a thread [~jpountz] started: [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9905) Revise approach to specifying NN algorithm
[ https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-9905: - Description: In LUCENE-9322 we decided that the new vectors API shouldn’t assume a particular nearest-neighbor search data structure and algorithm. This flexibility is important since NN search is a developing area and we'd like to be able to experiment and evolve the algorithm. Right now we only have one algorithm (HNSW), but we want to maintain the ability to use another. Currently the algorithm to use is specified through {{SearchStrategy}}, for example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is expected to handle multiple algorithms. Instead we could have one format implementation per algorithm. Our current implementation would be HNSW-specific like {{HnswVectorFormat}}, and to experiment with another algorithm you could create a new implementation like {{ClusterVectorFormat}}. This would be better aligned with the codec framework, and help avoid exposing algorithm details in the API. A concrete proposal (note many of these names will change when LUCENE-9855 is addressed): # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}. # Remove references to HNSW in {{SearchStrategy}}, so there is just {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like {{SimilarityFunction}}. # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}. # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or parameters to be configured per-field \(?\) One note: the current HNSW-based format includes logic for storing a numeric vector per document, as well as constructing + storing a HNSW graph. When adding another implementation, it’d be nice to be able to reuse logic for reading/ writing numeric vectors. I don’t think we need to design for this right now, but we can keep it in mind for the future? This issue is based on a thread [~jpountz] started: [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E] was: In LUCENE-9322 we decided that the new vectors API shouldn’t assume a particular nearest-neighbor search data structure and algorithm. This flexibility is important since NN search is a developing area and we'd like to be able to experiment and evolve the algorithm. Right now we only have one algorithm (HNSW), but we want to maintain the ability to use another. Currently the algorithm to use is specified through {{SearchStrategy}}, for example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is expected to handle multiple algorithms. Instead we could have one format implementation per algorithm. Our current implementation would be HNSW-specific like {{HnswVectorFormat}}, and to experiment with another algorithm you could create a new implementation like {{ClusterVectorFormat}}. This would be better aligned with the codec framework, and help avoid exposing algorithm details in the API. A concrete proposal (note many of these names will change when LUCENE-9855 is addressed): 1. Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}. 2. Remove references to HNSW in {{SearchStrategy}}, so there is just {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like {{SimilarityFunction}}. 3. Remove {{FieldType}} attributes related to HNSW parameters (maxConn and beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}. 4. Introduce {{PerFieldVectorFormat}} to allow a different NN approach or parameters to be configured per-field (?) One note: the current HNSW-based format includes logic for storing a numeric vector per document, as well as constructing + storing a HNSW graph. When adding another implementation, it’d be nice to be able to reuse logic for reading/ writing numeric vectors. I don’t think we need to design for this right now, but we can keep it in mind for the future? This issue is based on a thread [~jpountz] started: [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E] > Revise approach to specifying NN algorithm > -- > > Key: LUCENE-9905 > URL: https://issues.apache.org/jira/browse/LUCENE-9905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Major > > In LUCENE-9322 we decided that the new vectors API shouldn’t assume a > par
[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315184#comment-17315184 ] Julie Tibshirani commented on LUCENE-9855: -- Quick note: I spun off LUCENE-9905 to discuss the approach for specifying the NN algorithm (question #2 in my comment above). Maybe we could hold off on a PR for LUCENE-9905 until we settle on names here, to avoid unnecessary conflicts. > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9906) TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error
[ https://issues.apache.org/jira/browse/LUCENE-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-9906: - Issue Type: Test (was: Improvement) > TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error > > > Key: LUCENE-9906 > URL: https://issues.apache.org/jira/browse/LUCENE-9906 > Project: Lucene - Core > Issue Type: Test >Reporter: Julie Tibshirani >Priority: Major > > This reproduced for me on {branch_8x}, but I couldn't reproduce on {main} > using the same seed. > {code} > ant test -Dtestcase=TestIndexSorting > -Dtests.method=testAddIndexesWithDeletionsAndDirectory > -Dtests.seed=BED8ADB81856E61F -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=tk-TM -Dtests.timezone=Asia/Bishkek -Dtests.asserts=true > -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.83s | > TestIndexSorting.testAddIndexesWithDeletionsAndDirectory <<< > [junit4] > Throwable #1: java.lang.RuntimeException: index sort changed > from , to > [junit4] > at > __randomizedtesting.SeedInfo.seed([BED8ADB81856E61F:C4679CC0D9440770]:0) > [junit4] > at > org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:650) > [junit4] > at > org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301) > [junit4] > at > org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:867) > [junit4] > at org.apache.lucene.util.IOUtils.close(IOUtils.java:89) > [junit4] > at org.apache.lucene.util.IOUtils.close(IOUtils.java:77) > [junit4] > at > org.apache.lucene.index.TestIndexSorting.testAddIndexes(TestIndexSorting.java:2125) > [junit4] > at > org.apache.lucene.index.TestIndexSorting.testAddIndexesWithDeletionsAndDirectory(TestIndexSorting.java:2141) > [junit4] > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > [junit4] > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) > [junit4] > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [junit4] > at > java.base/java.lang.reflect.Method.invoke(Method.java:564) > [junit4] > at java.base/java.lang.Thread.run(Thread.java:832) > [junit4] 2> NOTE: test params are: codec=Asserting(Lucene87): > {id=FST50}, docValues:{bar=DocValuesFormat(name=Lucene80), > foo=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=1970, > maxMBSortInHeap=5.567472541013199, > sim=Asserting(RandomSimilarity(queryNorm=false): {id=DFR GB3(800.0)}), > locale=tk-TM, timezone=Asia/Bishkek {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9906) TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error
Julie Tibshirani created LUCENE-9906: Summary: TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error Key: LUCENE-9906 URL: https://issues.apache.org/jira/browse/LUCENE-9906 Project: Lucene - Core Issue Type: Improvement Reporter: Julie Tibshirani This reproduced for me on {branch_8x}, but I couldn't reproduce on {main} using the same seed. {code} ant test -Dtestcase=TestIndexSorting -Dtests.method=testAddIndexesWithDeletionsAndDirectory -Dtests.seed=BED8ADB81856E61F -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=tk-TM -Dtests.timezone=Asia/Bishkek -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.83s | TestIndexSorting.testAddIndexesWithDeletionsAndDirectory <<< [junit4] > Throwable #1: java.lang.RuntimeException: index sort changed from , to [junit4] >at __randomizedtesting.SeedInfo.seed([BED8ADB81856E61F:C4679CC0D9440770]:0) [junit4] >at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:650) [junit4] >at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301) [junit4] >at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:867) [junit4] >at org.apache.lucene.util.IOUtils.close(IOUtils.java:89) [junit4] >at org.apache.lucene.util.IOUtils.close(IOUtils.java:77) [junit4] >at org.apache.lucene.index.TestIndexSorting.testAddIndexes(TestIndexSorting.java:2125) [junit4] >at org.apache.lucene.index.TestIndexSorting.testAddIndexesWithDeletionsAndDirectory(TestIndexSorting.java:2141) [junit4] >at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit4] >at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) [junit4] >at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit4] >at java.base/java.lang.reflect.Method.invoke(Method.java:564) [junit4] >at java.base/java.lang.Thread.run(Thread.java:832) [junit4] 2> NOTE: test params are: codec=Asserting(Lucene87): {id=FST50}, docValues:{bar=DocValuesFormat(name=Lucene80), foo=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=1970, maxMBSortInHeap=5.567472541013199, sim=Asserting(RandomSimilarity(queryNorm=false): {id=DFR GB3(800.0)}), locale=tk-TM, timezone=Asia/Bishkek {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9906) TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error
[ https://issues.apache.org/jira/browse/LUCENE-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-9906: - Description: This reproduced for me on {{branch_8x}} , but I couldn't reproduce on {{main}} using the same seed. {code} ant test -Dtestcase=TestIndexSorting -Dtests.method=testAddIndexesWithDeletionsAndDirectory -Dtests.seed=BED8ADB81856E61F -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=tk-TM -Dtests.timezone=Asia/Bishkek -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.83s | TestIndexSorting.testAddIndexesWithDeletionsAndDirectory <<< [junit4] > Throwable #1: java.lang.RuntimeException: index sort changed from , to [junit4] >at __randomizedtesting.SeedInfo.seed([BED8ADB81856E61F:C4679CC0D9440770]:0) [junit4] >at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:650) [junit4] >at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301) [junit4] >at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:867) [junit4] >at org.apache.lucene.util.IOUtils.close(IOUtils.java:89) [junit4] >at org.apache.lucene.util.IOUtils.close(IOUtils.java:77) [junit4] >at org.apache.lucene.index.TestIndexSorting.testAddIndexes(TestIndexSorting.java:2125) [junit4] >at org.apache.lucene.index.TestIndexSorting.testAddIndexesWithDeletionsAndDirectory(TestIndexSorting.java:2141) [junit4] >at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit4] >at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) [junit4] >at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit4] >at java.base/java.lang.reflect.Method.invoke(Method.java:564) [junit4] >at java.base/java.lang.Thread.run(Thread.java:832) [junit4] 2> NOTE: test params are: codec=Asserting(Lucene87): {id=FST50}, docValues:{bar=DocValuesFormat(name=Lucene80), foo=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=1970, maxMBSortInHeap=5.567472541013199, sim=Asserting(RandomSimilarity(queryNorm=false): {id=DFR GB3(800.0)}), locale=tk-TM, timezone=Asia/Bishkek {code} was: This reproduced for me on {branch_8x}, but I couldn't reproduce on {main} using the same seed. {code} ant test -Dtestcase=TestIndexSorting -Dtests.method=testAddIndexesWithDeletionsAndDirectory -Dtests.seed=BED8ADB81856E61F -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=tk-TM -Dtests.timezone=Asia/Bishkek -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.83s | TestIndexSorting.testAddIndexesWithDeletionsAndDirectory <<< [junit4] > Throwable #1: java.lang.RuntimeException: index sort changed from , to [junit4] >at __randomizedtesting.SeedInfo.seed([BED8ADB81856E61F:C4679CC0D9440770]:0) [junit4] >at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:650) [junit4] >at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301) [junit4] >at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:867) [junit4] >at org.apache.lucene.util.IOUtils.close(IOUtils.java:89) [junit4] >at org.apache.lucene.util.IOUtils.close(IOUtils.java:77) [junit4] >at org.apache.lucene.index.TestIndexSorting.testAddIndexes(TestIndexSorting.java:2125) [junit4] >at org.apache.lucene.index.TestIndexSorting.testAddIndexesWithDeletionsAndDirectory(TestIndexSorting.java:2141) [junit4] >at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit4] >at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) [junit4] >at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit4] >at java.base/java.lang.reflect.Method.invoke(Method.java:564) [junit4] >at java.base/java.lang.Thread.run(Thread.java:832) [junit4] 2> NOTE: test params are: codec=Asserting(Lucene87): {id=FST50}, docValues:{bar=DocValuesFormat(name=Lucene80), foo=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=1970, maxMBSortInHeap=5.567472541013199, sim=Asserting(RandomSimilarity(queryNorm=false): {id=DFR GB3(800.0)}), locale=tk-TM, timezone=Asia/Bishkek {code} > TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error > > > Key: LUCENE-9906 > URL: https://issues.apache.org/jira/browse/LUCENE-9906 > Project: Lucene - Core
[jira] [Updated] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9855: -- Fix Version/s: main (9.0) > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Blocker > Fix For: main (9.0) > > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org