[jira] [Updated] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Hertel updated LUCENE-10562: --- Description: I use Solr and have a large system with 1TB in one core and about 5 million documents. The textual content of large PDF files is indexed there. My query is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. {code:java} *searchvalue* {code} , even though I put a filter query in front of it that reduces to less than 20 documents. searchvalue -> less than 1 second searchvalue* -> less than 1 second My query: {code:java} select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 {code} I've tried everything imaginable. It doesn't make sense to me why a search over a small subset should take so long. If I omit the filter query metadataitemids_is:20950, so search the entire inventory, then it also takes the same amount of time. Therefore, I suspect that despite the filter query, the main query runs over the entire index. was: I use Solr and have a large system with 1TB in one core and about 5 million documents. The textual content of large PDF files is indexed there. My query is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. {code:java} *searchvalue* {code} , even though I put a filter query in front of it that reduces to less than 20 documents. searchvalue -> less than 1 second searchvalue* -> less than 1 second My query: {code:java} select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fq=renditions_ss%3A&fl=id&rows=50&start=0 {code} I've tried everything imaginable. It doesn't make sense to me why a search over a small subset should take so long. If I omit the filter query metadataitemids_is:20950, so search the entire inventory, then it also takes the same amount of time. Therefore, I suspect that despite the filter query, the main query runs over the entire index. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533735#comment-17533735 ] Tomoko Uchida commented on LUCENE-10562: Infix or suffix wildcard query is extremely slow in its nature and not recommended - see the documentation. https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/search/WildcardQuery.html > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533739#comment-17533739 ] Tomoko Uchida commented on LUCENE-10562: As for "despite filter query", sorry, why you can assume filters are executed before wildcard queries? > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533743#comment-17533743 ] Henrik Hertel commented on LUCENE-10562: Thanks for your answer. Well, from my naive point of view, I would expect the textual search to be performed only over the subset. But this is probably not possible, is it? I have read various sources and there have been conflicting statements about this. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533747#comment-17533747 ] Tomoko Uchida commented on LUCENE-10562: I'm not fully sure how filters are implemented in Solr, but at least in recent Lucene, there is no substantial difference between filters and queries in implementation (the former is actually a normal query, that skips score calculation); and there is no means to control query execution order (as far as I know). > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533751#comment-17533751 ] Tomoko Uchida commented on LUCENE-10562: One thing I could recommend is, that instead of using regex queries for suffix match, you could "reverse" each term and convert suffix match into prefix match. ReverseStringFilter does the trick. https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533751#comment-17533751 ] Tomoko Uchida edited comment on LUCENE-10562 at 5/9/22 10:57 AM: - One thing I could recommend is, that instead of using wildcard queries for suffix match, you could "reverse" each term and convert suffix match into prefix match. ReverseStringFilter does the trick. https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html was (Author: tomoko uchida): One thing I could recommend is, that instead of using regex queries for suffix match, you could "reverse" each term and convert suffix match into prefix match. ReverseStringFilter does the trick. https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533762#comment-17533762 ] Henrik Hertel commented on LUCENE-10562: Sure, that could help, but I guess that would increase my index size by some factor. I might evaluate that option, thanks alot. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533765#comment-17533765 ] Tomoko Uchida commented on LUCENE-10562: bq. I guess that would increase my index size by some factor You're right - this is the pain point of the "reversing term" strategy. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW
mayya-sharipova commented on PR #872: URL: https://github.com/apache/lucene/pull/872#issuecomment-1121144832 @jtibshirani Thanks for the comment. I've rerun the benchmarks as you suggested, and here are the new results ```txt kApproach Recall QPS 10 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.571 1874.073 50 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.801 752.443 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.865 463.214 500 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.959 129.944 800 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 87.815 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 73.514 10 hnswlib ({'M': 32, 'efConstruction': 100})0.552 16745.433 50 hnswlib ({'M': 32, 'efConstruction': 100})0.794 5738.468 100 hnswlib ({'M': 32, 'efConstruction': 100})0.860 3336.386 500 hnswlib ({'M': 32, 'efConstruction': 100})0.956 832.982 800 hnswlib ({'M': 32, 'efConstruction': 100})0.973 541.097 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979 442.163 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova opened a new pull request, #874: LUCENE-10471 Increse max dims for vectors to 2048
mayya-sharipova opened a new pull request, #874: URL: https://github.com/apache/lucene/pull/874 Increase the maximum number of dims for KNN vectors to 2048. The current maximum allowed number of dimensions is equal to 1024. But we see in practice a number of models that produce vectors with > 1024 dimensions, especially for image encoding (e.g mobilenet_v2 uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing max dims to `2048` will satisfy these use cases. We will not recommend further increase of vector dims. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533856#comment-17533856 ] Uwe Schindler commented on LUCENE-10562: Hi, I think those question do not relate to Lucene and are no issues at all. I think those quetsions should be asked on the Solr mailing list: us...@solr.apache.org. This is not a bug and there is no way to improve this situation inside Lucene. Some additional hints: - Consider using the reverse wildcard filter in Solr (there's documentation about this). But this won't help if you need a wildcard on both sides of the star - Consider to disable wildcards for end-users in your case (the flexible or dismax query parser in Solr can do this) In general, using wildcards in a full text search engine is showing that text analysis works wrong. Based on your name and profile, it looks like this is a typical "German language problem". In Germany, compounds are usual ("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the German river Donau) and then people using wildcards is always a sign for missing decompounding. This can be done with hyphenation-compound token filter in combination with dictionaries. An example and minimalized data files for German language is here: https://github.com/uschindler/german-decompounder When you do decompounding, wildcards should not be needed. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-10562. Resolution: Won't Fix > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533856#comment-17533856 ] Uwe Schindler edited comment on LUCENE-10562 at 5/9/22 3:30 PM: Hi, I think those question do not relate to Lucene and are no issues at all. I think those questions should be asked on the Solr mailing list: us...@solr.apache.org. This is not a bug and there is no way to improve this situation inside Lucene. Some additional hints: - Consider using the reverse wildcard filter in Solr (there's documentation about this). But this won't help if you need a wildcard on both sides of the star - Consider to disable wildcards for end-users in your case (the flexible or dismax query parser in Solr can do this) - Decompounding may be needed (see below) In general, using wildcards in a full text search engine is showing that text analysis works wrong. Based on your name and profile, it looks like this is a typical "German language problem". In Germany, compounds are usual ("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the German river Donau) and then people using wildcards is always a sign for missing decompounding. This can be done with hyphenation-compound token filter in combination with dictionaries. An example and minimalized data files for German language is here: https://github.com/uschindler/german-decompounder When you do decompounding, wildcards should not be needed. was (Author: thetaphi): Hi, I think those question do not relate to Lucene and are no issues at all. I think those quetsions should be asked on the Solr mailing list: us...@solr.apache.org. This is not a bug and there is no way to improve this situation inside Lucene. Some additional hints: - Consider using the reverse wildcard filter in Solr (there's documentation about this). But this won't help if you need a wildcard on both sides of the star - Consider to disable wildcards for end-users in your case (the flexible or dismax query parser in Solr can do this) In general, using wildcards in a full text search engine is showing that text analysis works wrong. Based on your name and profile, it looks like this is a typical "German language problem". In Germany, compounds are usual ("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the German river Donau) and then people using wildcards is always a sign for missing decompounding. This can be done with hyphenation-compound token filter in combination with dictionaries. An example and minimalized data files for German language is here: https://github.com/uschindler/german-decompounder When you do decompounding, wildcards should not be needed. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533856#comment-17533856 ] Uwe Schindler edited comment on LUCENE-10562 at 5/9/22 3:31 PM: Hi, I think those question do not relate to Lucene and are no issues at all. I think those questions should be asked on the Solr mailing list: us...@solr.apache.org. This is not a bug and there is no way to improve this situation inside Lucene. Some additional hints: - Consider using the reverse wildcard filter in Solr (there's documentation about this). But this won't help if you need a wildcard on both sides of the term - Consider to disable wildcards for end-users in your case (the flexible or dismax query parser in Solr can do this) - Decompounding may be needed (see below) In general, using wildcards in a full text search engine is showing that text analysis works wrong. Based on your name and profile, it looks like this is a typical "German language problem". In Germany, compounds are usual ("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the German river Donau) and then people using wildcards is always a sign for missing decompounding. This can be done with hyphenation-compound token filter in combination with dictionaries. An example and minimalized data files for German language is here: https://github.com/uschindler/german-decompounder When you do decompounding, wildcards should not be needed. was (Author: thetaphi): Hi, I think those question do not relate to Lucene and are no issues at all. I think those questions should be asked on the Solr mailing list: us...@solr.apache.org. This is not a bug and there is no way to improve this situation inside Lucene. Some additional hints: - Consider using the reverse wildcard filter in Solr (there's documentation about this). But this won't help if you need a wildcard on both sides of the star - Consider to disable wildcards for end-users in your case (the flexible or dismax query parser in Solr can do this) - Decompounding may be needed (see below) In general, using wildcards in a full text search engine is showing that text analysis works wrong. Based on your name and profile, it looks like this is a typical "German language problem". In Germany, compounds are usual ("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the German river Donau) and then people using wildcards is always a sign for missing decompounding. This can be done with hyphenation-compound token filter in combination with dictionaries. An example and minimalized data files for German language is here: https://github.com/uschindler/german-decompounder When you do decompounding, wildcards should not be needed. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533857#comment-17533857 ] Uwe Schindler commented on LUCENE-10562: As explanation why this is slow: It has nothing to do with filters or query first. The problem is already before that: Wildcard queries are expanded to filter bitsets / large OR queries in the query preprocessing (rewrite mode). This happens before the actualy query is executed. So as soon as you have a wildcard with many matching terms, the preprocessing takes a significant amount of time. The actual query execution is fast and can be optimized. Due to the way on how an inverted index is built, there's no way to use another query to limit the amount of preprocessing work. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533857#comment-17533857 ] Uwe Schindler edited comment on LUCENE-10562 at 5/9/22 3:38 PM: As explanation why this is slow: It has nothing to do with filters or query first. The problem is already before that: Wildcard queries are expanded to filter bitsets / large OR queries in the query preprocessing (rewrite mode). This happens before the actualy query is executed. So as soon as you have a wildcard with many matching terms, the preprocessing takes a significant amount of time. The actual query execution is fast and can be optimized. Due to the way on how an inverted index is built, there's no way to use another query to limit the amount of preprocessing work. The preprocessing time is linear to the total number of terms in a field, not size of index or number of documents. was (Author: thetaphi): As explanation why this is slow: It has nothing to do with filters or query first. The problem is already before that: Wildcard queries are expanded to filter bitsets / large OR queries in the query preprocessing (rewrite mode). This happens before the actualy query is executed. So as soon as you have a wildcard with many matching terms, the preprocessing takes a significant amount of time. The actual query execution is fast and can be optimized. Due to the way on how an inverted index is built, there's no way to use another query to limit the amount of preprocessing work. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #875: LUCENE-10560: Speed up OrdinalMap construction a bit.
jpountz opened a new pull request, #875: URL: https://github.com/apache/lucene/pull/875 I benchmarked OrdinalMap construction over high-cardinality fields, and lots of time gets spent into `PriorityQueue#downHeap` due to entry comparisons. I added a small hack that speeds up these comparisons a bit by extracting the first 8 bytes of the terms as a comparable unsigned long, and using this long whenever possible for comparisons. On a dataset that consists of 100M documents and 10M unique values that consist of 16-bytes random bytes, OrdinalMap construction went from 9.4s to 6.0s. On the same number of docs/values where values consist of the same 8-bytes prefix and then 8 random bytes to simulate a worst-case scenario for this change, OrdinalMap construction went from 9.6s to 10.1s. So this looks like it can yield a significant speedup in some scenarios, while the slowdown is contained in the worst-case scenario? Unfortunately, this worst-case scenario is not exactly unlikely, e.g. this is what you would get with a dataset of IPv4-mapped IPv6 addresses, where all values share the same 12-bytes prefix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533874#comment-17533874 ] Henrik Hertel commented on LUCENE-10562: [~uschindler] you are correct, it is indeed a german problem :) Thank you for the explanations and the tips and I will definitely take a look at everything! > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533911#comment-17533911 ] Balmukund Mandal commented on LUCENE-9625: -- I was trying to run the benchmark and has a couple of questions. Indexing takes a long time, so is there a way to configure the benchmark to use an already existing index for search? Also, is there a way to configure the benchmark to use multiple threads for indexing (looks to me that it’s a single-threaded indexing)? > Benchmark KNN search with ann-benchmarks > > > Key: LUCENE-9625 > URL: https://issues.apache.org/jira/browse/LUCENE-9625 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In addition to benchmarking with luceneutil, it would be good to be able to > make use of ann-benchmarks, which is publishing results from many approximate > knn algorithms, including the hnsw implementation from its authors. We don't > expect to challenge the performance of these native code libraries, however > it would be good to know just how far off we are. > I started looking into this and posted a fork of ann-benchmarks that uses > KnnGraphTester class to run these: > https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to > manually copy jars and the KnnGraphTester.class to the test host machine > rather than downloading from a distribution. KnnGraphTester needs some > modifications in order to support this process - this issue is mostly about > that. > One thing I noticed is that some of the index builds with higher fanout > (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is > concerning and I'll open a separate issue for trying to improve that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts
gsmiller commented on PR #843: URL: https://github.com/apache/lucene/pull/843#issuecomment-1121351525 @Yuti-G would it make sense to close out this PR since I don't think we plan to merge this as it is? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10538) TopN is not being used in getTopChildren()
[ https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533912#comment-17533912 ] Greg Miller commented on LUCENE-10538: -- So I think the order of operations here is: 1. Deliver [LUCENE-10550|https://issues.apache.org/jira/browse/LUCENE-10550], which would effectively _copy_ the currently "top children" functionality of range faceting to a new API method for getting all children (which is what it's really doing). 2. Fix the existing "top children" functionality of range faceting to actually return top children (and honor the top-n parameter). I think this issue now effectively captures #2, and is blocked until LUCENE-10550 is delivered. Does that sound right [~yutinggan]? > TopN is not being used in getTopChildren() > -- > > Key: LUCENE-10538 > URL: https://issues.apache.org/jira/browse/LUCENE-10538 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > When looking at the overridden implementation getTopChildren(int topN, String > dim, String... path) in RangeFacetCounts, I found that the topN parameter is > not being used in the code, and the unit tests did not test this function > properly. I will create a PR to fix this, and will look into other overridden > implementations and see if they have the same issue. Please let me know if > there is any question. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format
jtibshirani commented on code in PR #870: URL: https://github.com/apache/lucene/pull/870#discussion_r868240961 ## lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene92; + +import java.io.IOException; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.hnsw.HnswGraph; + +/** + * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph + * connecting the documents having values. The graph is used to power HNSW search. The format + * consists of three files: + * + * .vec (vector data) file Review Comment: Thanks! Could you please list them in order like we do for the `.vex` file below? I think it makes it more precise/ easier to read. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW
jtibshirani commented on PR #872: URL: https://github.com/apache/lucene/pull/872#issuecomment-1121379649 Thanks, this looks the same as what I was seeing now! It's good motivation to add Lucene to ann-benchmarks so we can stop using a custom local benchmark set-up! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW
jtibshirani commented on code in PR #872: URL: https://github.com/apache/lucene/pull/872#discussion_r868255093 ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java: ## @@ -53,12 +53,14 @@ public final class Lucene91HnswVectorsWriter extends KnnVectorsWriter { private final int maxDoc; private final int maxConn; + private final int maxConn0; private final int beamWidth; private boolean finished; - Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int beamWidth) + Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int maxConn0, int beamWidth) Review Comment: I was thinking we could just keep a single configuration parameter here, and internally calculate `maxConn0 = 2 * M`. If we allow it to be passed as a parameter, it seems like it's important to be able to configure it, but that's not the case (it is not something users will change and should always be set to `2 * M`). Like that we could also avoid writing a new value `maxConn0` into the format, which doesn't seem necessary? If we are worried about naming, we could rename `maxConn` to `M`. From my perspective, it's okay to use single-letter variable names (with a clear comment!) when it directly corresponds to a paper's algorithm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #870: LUCENE-10502: Refactor hnswVectors format
mayya-sharipova commented on PR #870: URL: https://github.com/apache/lucene/pull/870#issuecomment-1121489041 @msokolov Thanks for your feedback on this PR. I am wondering if you have any further feedback for this work. It would be nice to get it merged for 9.2 Lucene release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533982#comment-17533982 ] Tomoko Uchida commented on LUCENE-10562: Yes, it's not about "query execution (retrieving inverted index)" but "what postings you are traversing"; there's no way to optimize when you have so many postings that match the wildcard term. Just for a practical tip, I usually enforce two or three leading characters for wildcard queries when it's needed to reduce the search space. bq. Consider using the reverse wildcard filter in Solr (there's documentation about this). But this won't help if you need a wildcard on both sides of the term I think you could run conjunction queries of two fields (one for the normal text field, one for the reversed field) to support the infix wildcard query - not sure it is worth adding another text field when the index is already large. Anyway, dictionary-based decomposition looks promising to me in the case of German (though I have little knowledge of it except for the very basic one I've learned at the university lectures). > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533982#comment-17533982 ] Tomoko Uchida edited comment on LUCENE-10562 at 5/9/22 7:43 PM: Yes, it's not about "query execution (retrieving inverted index)" but "what postings you are traversing"; there's no way to optimize when you have so many postings that match the wildcard term. Just for a practical tip, I usually enforce two or three leading characters for wildcard queries when it's needed to reduce the search space. {quote}Consider using the reverse wildcard filter in Solr (there's documentation about this). But this won't help if you need a wildcard on both sides of the term {quote} -I think you could run conjunction queries of two fields (one for the normal text field, one for the reversed field) to support the infix wildcard query- - not sure it is worth adding another text field when the index is already large. Anyway, dictionary-based decomposition looks promising to me in the case of German (though I have little knowledge of it except for the very basic one I've learned at the university lectures). Correction: conjunction query does not work in this situation - sorry. nGram or more sophisticated term decomposition will be needed. was (Author: tomoko uchida): Yes, it's not about "query execution (retrieving inverted index)" but "what postings you are traversing"; there's no way to optimize when you have so many postings that match the wildcard term. Just for a practical tip, I usually enforce two or three leading characters for wildcard queries when it's needed to reduce the search space. bq. Consider using the reverse wildcard filter in Solr (there's documentation about this). But this won't help if you need a wildcard on both sides of the term I think you could run conjunction queries of two fields (one for the normal text field, one for the reversed field) to support the infix wildcard query - not sure it is worth adding another text field when the index is already large. Anyway, dictionary-based decomposition looks promising to me in the case of German (though I have little knowledge of it except for the very basic one I've learned at the university lectures). > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress
[ https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533993#comment-17533993 ] Peixin Li commented on LUCENE-10551: we have identified the issue is not related to code in Lucene. We are using GraalVM and its does some optimization for the part of the code and caused the issue. we changed [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/compress/LowercaseAsciiCompression.java#L55] from {code:java} while (i - previousExceptionIndex > 0xFF) { ++numExceptions; previousExceptionIndex += 0xFF; } {code} to {code:java} while (i - previousExceptionIndex > 0xFF) { log.trace("{}", previousExceptionIndex); ++numExceptions; previousExceptionIndex += 0xFF; } {code} then that illegalState exception not show up anymore > LowercaseAsciiCompression should return false when it's unable to compress > -- > > Key: LUCENE-10551 > URL: https://issues.apache.org/jira/browse/LUCENE-10551 > Project: Lucene - Core > Issue Type: Bug > Environment: Lucene version 8.11.1 >Reporter: Peixin Li >Priority: Major > Attachments: LUCENE-10551-test.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > {code:java} > Failed to commit.. > java.lang.IllegalStateException: 10 <> 5 > cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion > cloud gen2tion instance - dev1tion instance - > testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o > at > org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318) > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170) > at > org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120) > at > org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267) > at > org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350) > at > org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476) > at > org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656) > at > org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364) > at > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770) > at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) > {code} > {code:java} > key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow, > resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, > domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1}) > java.lang.IllegalStateException: 29 <> 16 > analytics-platform-test/koala/cluster-tool:1.0-20220310151438.492,mesh_istio_examples-bookinfo-details-v1:1.16.2mesh_istio_examples-bookinfo-reviews-v3:1.16.2oce-clamav:1.0.219oce-tesseract:1.0.7oce-traefik:2.5.1oci-opensearch:1.2.4.8.103oda-digital-assistant-control-plane-train-pool-workflow-v6:22.02.14oke-coresvcs-k8s-dns-dnsmasq-nanny-amd64@sha256:41aa9160ceeaf712369ddb660d02e5ec06d1679965e6930351967c8cf5ed62d4oke-coresvcs-k8s-dns-kube-dns-amd64@sha256:2cf34b04106974952996c6ef1313f165ce65b4ad68a3051f51b1b8f91ba5f838oke-coresvcs-k8s-dns-sidecar-amd64@sha256:8a82c7288725cb4de9c7cd8d5a78279208e379f35751539b406077f9a3163dcdoke-coresvcs-node-problem-detector@sha256:9d54df11804a862c54276648702a45a6a0027a9d930a86becd69c34cc84bf510oke-coresvcs-oke-fluentd-lumberjack@sha256:5f3f10b187eb804ce4e84bc3672de1cf318c0f793f00dac01cd7da8beea8f269oke-etcd-operator@sha256:4353a2e5ef02bb0f6b046a8d6219b1af359a2c1141c358ff110e395f29d0bfc8oke-oke-hyperkube-amd64@sha256:3c734f46099400507f938090eb9a874338fa25cde425ac9409df4c885759752foke-public-busybox@sha256:4cee1979ba0bf7
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533982#comment-17533982 ] Tomoko Uchida edited comment on LUCENE-10562 at 5/9/22 8:08 PM: Yes, it's not about "query execution (retrieving inverted index)" but "what postings you are traversing"; there's no way to optimize when you have so many postings that match the wildcard term. Just for a practical tip, I usually enforce two or three leading characters for wildcard queries when it's needed to reduce the search space. {quote}Consider using the reverse wildcard filter in Solr (there's documentation about this). But this won't help if you need a wildcard on both sides of the term {quote} -I think you could run conjunction queries of two fields (one for the normal text field, one for the reversed field) to support the infix wildcard query- - not sure it is worth adding another text field when the index is already large. Anyway, dictionary-based decomposition looks promising to me in the case of German (though I have little knowledge of it except for the very basic one I've learned at the university lectures). Correction: conjunction query does not work in this situation - sorry. nGram or more sophisticated term decomposition will be needed. For example in my language (Japanese - which does not even has spaces between terms), the combination of off-the-shelf nGram filter and phrase search often works well. was (Author: tomoko uchida): Yes, it's not about "query execution (retrieving inverted index)" but "what postings you are traversing"; there's no way to optimize when you have so many postings that match the wildcard term. Just for a practical tip, I usually enforce two or three leading characters for wildcard queries when it's needed to reduce the search space. {quote}Consider using the reverse wildcard filter in Solr (there's documentation about this). But this won't help if you need a wildcard on both sides of the term {quote} -I think you could run conjunction queries of two fields (one for the normal text field, one for the reversed field) to support the infix wildcard query- - not sure it is worth adding another text field when the index is already large. Anyway, dictionary-based decomposition looks promising to me in the case of German (though I have little knowledge of it except for the very basic one I've learned at the university lectures). Correction: conjunction query does not work in this situation - sorry. nGram or more sophisticated term decomposition will be needed. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048
[ https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534018#comment-17534018 ] Julie Tibshirani commented on LUCENE-10471: --- I also don't have an objection to increasing it a bit. But along the same lines as Robert's point, it'd be good to think about our decision making process -- otherwise we'd be tempted to continuously increase it. I've already heard users requesting 12288 dims (to handle OpenAI DaVinci embeddings). Two possible approaches I could see: 1. We do more research on the literature and decide on a reasonable max dimension. If a user wants to go beyond that, they should reconsider the model or perform dimensionality reduction. This would encourage users to think through their embedding strategy to optimize for performance. The improvements can be significant, since search time scales with vector dimensionality. 2. Or we take a flexible approach where we bump the limit to a high upper bound. This upper bound would be based on how much memory usage is reasonable for one vector (similar to the max term size?) I feel a bit better about approach 2 because I'm not confident I could come up with a statement about a "reasonable max dimension", especially given the fast-moving research. > Increase the number of dims for KNN vectors to 2048 > --- > > Key: LUCENE-10471 > URL: https://issues.apache.org/jira/browse/LUCENE-10471 > Project: Lucene - Core > Issue Type: Wish >Reporter: Mayya Sharipova >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > The current maximum allowed number of dimensions is equal to 1024. But we see > in practice a couple well-known models that produce vectors with > 1024 > dimensions (e.g > [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1] > uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing > max dims to `2048` will satisfy these use cases. > I am wondering if anybody has strong objections against this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10538) TopN is not being used in getTopChildren()
[ https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534019#comment-17534019 ] Yuting Gan commented on LUCENE-10538: - Yes, thanks [~gsmiller]! I am working on LUCENE-10550 and should have a PR out soon, and then I will re-visit this issue and return real top-n children. > TopN is not being used in getTopChildren() > -- > > Key: LUCENE-10538 > URL: https://issues.apache.org/jira/browse/LUCENE-10538 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > When looking at the overridden implementation getTopChildren(int topN, String > dim, String... path) in RangeFacetCounts, I found that the topN parameter is > not being used in the code, and the unit tests did not test this function > properly. I will create a PR to fix this, and will look into other overridden > implementations and see if they have the same issue. Please let me know if > there is any question. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G closed pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts
Yuti-G closed pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts URL: https://github.com/apache/lucene/pull/843 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #777: LUCENE-10488: Optimize Facets#getTopDims in ConcurrentSortedSetDocValuesFacetCounts
gsmiller commented on PR #777: URL: https://github.com/apache/lucene/pull/777#issuecomment-1121602253 This looks great! Thanks @Yuti-G! It would be nice if we could create a common abstract class to hold some of the common logic between this and the non-concurrent implementation. Seems like a lot of copy/paste going on. This isn't a new problem though, so let's not do that work as part of this PR. What do you think of opening a separate issue to see if we can consolidate some common logic? You probably have a better idea of how feasible this is after working on these changes, so I'm curious what you think. Thanks again for taking this on! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW
mayya-sharipova commented on code in PR #872: URL: https://github.com/apache/lucene/pull/872#discussion_r868594873 ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java: ## @@ -53,12 +53,14 @@ public final class Lucene91HnswVectorsWriter extends KnnVectorsWriter { private final int maxDoc; private final int maxConn; + private final int maxConn0; private final int beamWidth; private boolean finished; - Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int beamWidth) + Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int maxConn0, int beamWidth) Review Comment: @jtibshirani Thanks for the comment, this is a great suggestion! Addressed in b1a6394402d8d535b092de73b227fb5a40015c65 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #875: LUCENE-10560: Speed up OrdinalMap construction a bit.
rmuir commented on PR #875: URL: https://github.com/apache/lucene/pull/875#issuecomment-1121772167 I kinda feel like in this case we are trying to outsmart the JIT compiler with optimizations it has for `Arrays.equals()`. I understand the idea that we could be smarter based on the data ... maybe ... but I don't think we "know enough". I'd rather us know exactly what that threshold is and write appropriate metadata (e.g. min/maxTermLength) if we can figure it out, better than causing regressions for some use-cases. bonus if any additional metadata metadata can be utilized by checkindex. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress
[ https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534074#comment-17534074 ] Robert Muir commented on LUCENE-10551: -- Thanks [~irislpx] for solving the mystery. Maybe there is some cheaper tweak/workaround we can apply to dodge the GraalVM bug? If you happen to find one, can you please let us know? I'd rather not spam users with logs, but if we can just prevent the problem by re-arranging the code without a performance hit, I think it is compelling? > LowercaseAsciiCompression should return false when it's unable to compress > -- > > Key: LUCENE-10551 > URL: https://issues.apache.org/jira/browse/LUCENE-10551 > Project: Lucene - Core > Issue Type: Bug > Environment: Lucene version 8.11.1 >Reporter: Peixin Li >Priority: Major > Attachments: LUCENE-10551-test.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > {code:java} > Failed to commit.. > java.lang.IllegalStateException: 10 <> 5 > cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion > cloud gen2tion instance - dev1tion instance - > testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o > at > org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318) > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170) > at > org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120) > at > org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267) > at > org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350) > at > org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476) > at > org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656) > at > org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364) > at > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770) > at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) > {code} > {code:java} > key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow, > resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, > domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1}) > java.lang.IllegalStateException: 29 <> 16 > analytics-platform-test/koala/cluster-tool:1.0-20220310151438.492,mesh_istio_examples-bookinfo-details-v1:1.16.2mesh_istio_examples-bookinfo-reviews-v3:1.16.2oce-clamav:1.0.219oce-tesseract:1.0.7oce-traefik:2.5.1oci-opensearch:1.2.4.8.103oda-digital-assistant-control-plane-train-pool-workflow-v6:22.02.14oke-coresvcs-k8s-dns-dnsmasq-nanny-amd64@sha256:41aa9160ceeaf712369ddb660d02e5ec06d1679965e6930351967c8cf5ed62d4oke-coresvcs-k8s-dns-kube-dns-amd64@sha256:2cf34b04106974952996c6ef1313f165ce65b4ad68a3051f51b1b8f91ba5f838oke-coresvcs-k8s-dns-sidecar-amd64@sha256:8a82c7288725cb4de9c7cd8d5a78279208e379f35751539b406077f9a3163dcdoke-coresvcs-node-problem-detector@sha256:9d54df11804a862c54276648702a45a6a0027a9d930a86becd69c34cc84bf510oke-coresvcs-oke-fluentd-lumberjack@sha256:5f3f10b187eb804ce4e84bc3672de1cf318c0f793f00dac01cd7da8beea8f269oke-etcd-operator@sha256:4353a2e5ef02bb0f6b046a8d6219b1af359a2c1141c358ff110e395f29d0bfc8oke-oke-hyperkube-amd64@sha256:3c734f46099400507f938090eb9a874338fa25cde425ac9409df4c885759752foke-public-busybox@sha256:4cee1979ba0bf7db9fc5d28fb7b798ca69ae95a47c5fecf46327720df4ff352doke-public-coredns@sha256:86f8cfc74497f04e181ab2e1d26d2fd8bd46c4b33ce24b55620efcdfcb214670oke-public-coredns@sha256:8cd974302f1f6108f6f31312f8181ae723b514e2022089cdcc3db10666c49228oke-public-etcd@sha256:b751e459bc2a8f079f6730dd8462671b253c7c8b0d0eb47c67888d5091c6bb77oke-public-etcd@sha256:d6a76200a6e9103681bc2cf7fefbcada0dd9372d52cf8964178d846b89959d14oke-public-etcd@sha256:f
[GitHub] [lucene] rmuir commented on pull request #777: LUCENE-10488: Optimize Facets#getTopDims in ConcurrentSortedSetDocValuesFacetCounts
rmuir commented on PR #777: URL: https://github.com/apache/lucene/pull/777#issuecomment-1121805275 just an observation, this is a large amount of code changes for performance change that may be in the noise? I'm a bit confused. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048
[ https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534086#comment-17534086 ] Robert Muir commented on LUCENE-10471: -- I think the major problem is still no Vector API in the java APIs. It changes this entire conversation completely when we think about this limit. if openjdk would release this low level vector API, or barring that, maybe some way to MR-JAR for it, or barring that, maybe some intrinsics such as SloppyMath.dotProduct and SloppyMath.matrixMultiply, maybe java wouldn't become the next COBOL. > Increase the number of dims for KNN vectors to 2048 > --- > > Key: LUCENE-10471 > URL: https://issues.apache.org/jira/browse/LUCENE-10471 > Project: Lucene - Core > Issue Type: Wish >Reporter: Mayya Sharipova >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > The current maximum allowed number of dimensions is equal to 1024. But we see > in practice a couple well-known models that produce vectors with > 1024 > dimensions (e.g > [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1] > uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing > max dims to `2048` will satisfy these use cases. > I am wondering if anybody has strong objections against this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048
mocobeta commented on PR #874: URL: https://github.com/apache/lucene/pull/874#issuecomment-1121814970 I'm curious about how such large models (to me) are practically common or will be common in the near future (in the IR area). I don't have enough expertise to agree or disagree - it's just a general (and maybe naive) question. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format
LuXugang commented on code in PR #870: URL: https://github.com/apache/lucene/pull/870#discussion_r868760034 ## lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene92; + +import java.io.IOException; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.hnsw.HnswGraph; + +/** + * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph + * connecting the documents having values. The graph is used to power HNSW search. The format + * consists of three files: + * + * .vec (vector data) file Review Comment: @jtibshirani addressed in https://github.com/apache/lucene/pull/870/commits/d1a26e55a6f65277e875e77fd096aa962988ef49 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #832: LUCENE-10532: remove @Slow annotation
rmuir commented on PR #832: URL: https://github.com/apache/lucene/pull/832#issuecomment-1121826968 Thanks @cpoerschke for correcting my dyslexia :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #832: LUCENE-10532: remove @Slow annotation
rmuir commented on PR #832: URL: https://github.com/apache/lucene/pull/832#issuecomment-1121832123 > I'm fine with this. Reasons for Slow (and other test groups) are various. I use Slow in projects where certain tests are indeed slow by nature - have to unpack the distribution/ fork processes or start networking layers. These are typically integration tests. They run on the CI but they're not mandatory for local developer runs. I don't think this is needed in Lucene either. I like the idea of the test groups actually, it came up on another PR: https://github.com/apache/lucene/pull/633 . I just think lucene would do better with different test groups to better test what we need (e.g. something like suggested `@Concurrent`. The `@Slow` doesn't help us IMO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #832: LUCENE-10532: remove @Slow annotation
rmuir merged PR #832: URL: https://github.com/apache/lucene/pull/832 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10532) Remove @Slow annotation
[ https://issues.apache.org/jira/browse/LUCENE-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534102#comment-17534102 ] ASF subversion and git services commented on LUCENE-10532: -- Commit 3edfeb5eb224344e35f3454f5d51288ab05452c1 in lucene's branch refs/heads/main from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3edfeb5eb22 ] LUCENE-10532: remove @Slow annotation (#832) Remove `@Slow` annotation, for more consistency with CI and local jobs. All tests can be fast! > Remove @Slow annotation > --- > > Key: LUCENE-10532 > URL: https://issues.apache.org/jira/browse/LUCENE-10532 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > This annotation is useless, people have gotten so lazy about using it, that > now there are proposals to mark tests that are not actually slow, with the > @Slow annotation. > Let's remove the annotation. I can't imagine a situation where we mark a test > @Slow and i don't veto it. we can keep tests clean. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10532) Remove @Slow annotation
[ https://issues.apache.org/jira/browse/LUCENE-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-10532. -- Fix Version/s: 9.2 Resolution: Fixed > Remove @Slow annotation > --- > > Key: LUCENE-10532 > URL: https://issues.apache.org/jira/browse/LUCENE-10532 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Fix For: 9.2 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This annotation is useless, people have gotten so lazy about using it, that > now there are proposals to mark tests that are not actually slow, with the > @Slow annotation. > Let's remove the annotation. I can't imagine a situation where we mark a test > @Slow and i don't veto it. we can keep tests clean. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10532) Remove @Slow annotation
[ https://issues.apache.org/jira/browse/LUCENE-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534112#comment-17534112 ] ASF subversion and git services commented on LUCENE-10532: -- Commit 87df4aa8511efaa6e21cac1d52c259b67ff248df in lucene's branch refs/heads/branch_9x from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=87df4aa8511 ] LUCENE-10532: remove @Slow annotation (#832) Remove `@Slow` annotation, for more consistency with CI and local jobs. All tests can be fast! > Remove @Slow annotation > --- > > Key: LUCENE-10532 > URL: https://issues.apache.org/jira/browse/LUCENE-10532 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Fix For: 9.2 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This annotation is useless, people have gotten so lazy about using it, that > now there are proposals to mark tests that are not actually slow, with the > @Slow annotation. > Let's remove the annotation. I can't imagine a situation where we mark a test > @Slow and i don't veto it. we can keep tests clean. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534116#comment-17534116 ] Robert Muir commented on LUCENE-9356: - The test seems wrong to me, for example it does not consider CRC-32 collisions. I think we should just remove the test for this reason. > Add tests for corruptions caused by byte flips > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534117#comment-17534117 ] Robert Muir commented on LUCENE-9356: - by the way, if we just want to improve the exception path, test the real corruption path, it is probably more practical for test to modify {{checksum}} of the file rather than random byte flip. You still have the issue of collisions, but maybe it is more straightforward to test it in this way? > Add tests for corruptions caused by byte flips > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534118#comment-17534118 ] Robert Muir commented on LUCENE-9356: - mulling on it more, this to me seems like the way to go. let's rewrite the test to explicitly change a byte in the expected checksum. random collisions no longer an issue, it should fail every time, right? This is easier to maintain and think about. This should be able to flush out any bugs in codecs that aren't doing the right thing. Either their checksum is buggy or checkIntegrity() method is buggy and not validating checksums. But its easier to think about. > Add tests for corruptions caused by byte flips > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format
jtibshirani commented on code in PR #870: URL: https://github.com/apache/lucene/pull/870#discussion_r868790940 ## lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene92; + +import java.io.IOException; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.hnsw.HnswGraph; + +/** + * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph + * connecting the documents having values. The graph is used to power HNSW search. The format + * consists of three files: + * + * .vec (vector data) file Review Comment: Sorry to make so many small notes, but why is OrdToDoc in its own sublist instead of the top-level list? Also the note "only in sparse case" applies to both DocIds and OrdToDoc right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format
jtibshirani commented on code in PR #870: URL: https://github.com/apache/lucene/pull/870#discussion_r868790940 ## lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene92; + +import java.io.IOException; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.hnsw.HnswGraph; + +/** + * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph + * connecting the documents having values. The graph is used to power HNSW search. The format + * consists of three files: + * + * .vec (vector data) file Review Comment: Sorry to make so many small notes, but why is OrdToDoc in its own sublist instead of the top-level list? Also the note "only in sparse case" applies to both DocIds and OrdToDoc right? The same comments apply to the javadoc about `.vem` below. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format
LuXugang commented on code in PR #870: URL: https://github.com/apache/lucene/pull/870#discussion_r868794312 ## lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene92; + +import java.io.IOException; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.hnsw.HnswGraph; + +/** + * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph + * connecting the documents having values. The graph is used to power HNSW search. The format + * consists of three files: + * + * .vec (vector data) file Review Comment: Thanks @jtibshirani ,you are right, I like these spotless things, addressed in https://github.com/apache/lucene/pull/870/commits/e3ee29f24541d9b7e9b947ca183d665583200dfb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10564) SparseFixedBitSet#or doesn't update memory accounting
Julie Tibshirani created LUCENE-10564: - Summary: SparseFixedBitSet#or doesn't update memory accounting Key: LUCENE-10564 URL: https://issues.apache.org/jira/browse/LUCENE-10564 Project: Lucene - Core Issue Type: Bug Reporter: Julie Tibshirani While debugging why a cache was using way more memory than expected, one of my colleagues noticed that {{SparseFixedBitSet#or}} doesn't update {{{}ramBytesUsed{}}}. Here's a unit test that demonstrates this: {code:java} public void testRamBytesUsed() throws IOException { BitSet bitSet = new SparseFixedBitSet(1000); long initialBytesUsed = bitSet.ramBytesUsed(); DocIdSetIterator disi = DocIdSetIterator.all(1000); bitSet.or(disi); assertTrue(bitSet.ramBytesUsed() > initialBytesUsed); } {code} It also looks like we don't have any tests for {{SparseFixedBitSet}} memory accounting (unless I've missed them!) It'd be nice to add more coverage there too. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534124#comment-17534124 ] Robert Muir commented on LUCENE-9356: - i'd disable compound file as a first pass on any test too. this one tries to take that on, but let's make progress :) especially considering how checksums work in compound files. likely the result of some crazy failures on this one. > Add tests for corruptions caused by byte flips > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534126#comment-17534126 ] Robert Muir commented on LUCENE-9356: - I think the "problem" in current test is the inherent double-checksumming that happens inside CFE, it defeats your test's guess at "didnt change checksum". Definitely too difficult to think about the problem this way. That's why i encourage changing the "expected" value. > Add tests for corruptions caused by byte flips > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534164#comment-17534164 ] Henrik Hertel commented on LUCENE-10562: [~tomoko] thanks for the additional tips > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org