[GitHub] [lucene-solr] jpountz commented on a change in pull request #1912: LUCENE-9535: Try to do larger flushes.
jpountz commented on a change in pull request #1912: URL: https://github.com/apache/lucene-solr/pull/1912#discussion_r526737612 ## File path: lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java ## @@ -112,19 +110,12 @@ private synchronized DocumentsWriterPerThread newWriter() { DocumentsWriterPerThread getAndLock() { synchronized (this) { ensureOpen(); - // Important that we are LIFO here! This way if number of concurrent indexing threads was once high, - // but has now reduced, we only use a limited number of DWPTs. This also guarantees that if we have suddenly - // a single thread indexing - final Iterator descendingIterator = freeList.descendingIterator(); - while (descendingIterator.hasNext()) { -DocumentsWriterPerThread perThread = descendingIterator.next(); -if (perThread.tryLock()) { - descendingIterator.remove(); - return perThread; -} + DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); + if (dwpt == null) { +// DWPT is already locked before return by this method: Review comment: > making me think the "allocate a new DWPT" case has something to do with the locking semantics. Hmm, this is exactly what my understanding is. :) To me the comment wat about highlighting that `newWriter()` implicitly takes the lock on the DWPT it creates? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on pull request #1912: LUCENE-9535: Try to do larger flushes.
jpountz commented on pull request #1912: URL: https://github.com/apache/lucene-solr/pull/1912#issuecomment-730272727 I'm planning to merge this change to see how it plays with nightly benchmarks, especially now that it moved to a ThreadRipper 3990X. I'll revert if it makes things slower. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on pull request #1912: LUCENE-9535: Try to do larger flushes.
dweiss commented on pull request #1912: URL: https://github.com/apache/lucene-solr/pull/1912#issuecomment-730275557 bq. especially now that it moved to a ThreadRipper 3990X. I'll revert if it makes things slower. Who's 'it'? :) I've been playing with TR 3970X and I can cause internal JVM warnings on GC not being able to catch up while all the threads are busy... it's fun to watch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz edited a comment on pull request #1912: LUCENE-9535: Try to do larger flushes.
jpountz edited a comment on pull request #1912: URL: https://github.com/apache/lucene-solr/pull/1912#issuecomment-730272727 I'm planning to merge this change to see how it plays with nightly benchmarks, especially now that ~~it~~ they moved to a ThreadRipper 3990X. I'll revert if it makes things slower. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on pull request #1912: LUCENE-9535: Try to do larger flushes.
jpountz commented on pull request #1912: URL: https://github.com/apache/lucene-solr/pull/1912#issuecomment-730276978 Woops I meant the nightly bencmarks, I edited my above message. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-15008) Avoid building OrdinalMap for each facet
Radu Gheorghe created SOLR-15008: Summary: Avoid building OrdinalMap for each facet Key: SOLR-15008 URL: https://issues.apache.org/jira/browse/SOLR-15008 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: Facet Module Affects Versions: 8.7 Reporter: Radu Gheorghe Attachments: Screenshot 2020-11-19 at 12.01.55.png I'm running against the following scenario: * [JSON] faceting on a high cardinality field * few matching documents => few unique values Yet the query almost always takes a long time. Here's an example taking almost 4s for ~300 documents and unique values (edited a bit): {code:java} "QTime":3869, "params":{ "json":"{\"query\": \"*:*\", \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", \"unique_id:49866\"] \"facet\": {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}", "rows":"0"}}, "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[] }, "facets":{ "count":333, "keywords":{ "buckets":[{ "val":"value1", "count":124}, ... {code} I did some [profiling with our Sematext Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it points me to OrdinalMap building. If I read the code right, an OrdinalMap is built with every facet. And it's expensive since there are many unique values in the shard (previously, there we more smaller shards, making latency better, but this approach doesn't scale for this particular use-case). If I'm right up to this point, I see a couple of potential improvements, [inspired from Elasticsearch|[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint]:] # Keep the OrdinalMap cached until the next softCommit, so that only the first query takes the penalty # Allow faceting on actual values (a Map) rather than ordinals, for situations like the one above where we have few matching documents. We could potentially auto-detect this scenario (e.g. by configuring a threshold) and use a Map when there are few documents I'm curious about what you're thinking: * would a PR/patch be welcome for any of the two ideas above? * do you see better options? am I missing something? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-15008) Avoid building OrdinalMap for each facet
[ https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radu Gheorghe updated SOLR-15008: - Description: I'm running against the following scenario: * [JSON] faceting on a high cardinality field * few matching documents => few unique values Yet the query almost always takes a long time. Here's an example taking almost 4s for ~300 documents and unique values (edited a bit): {code:java} "QTime":3869, "params":{ "json":"{\"query\": \"*:*\", \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", \"unique_id:49866\"] \"facet\": {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}", "rows":"0"}}, "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[] }, "facets":{ "count":333, "keywords":{ "buckets":[{ "val":"value1", "count":124}, ... {code} I did some [profiling with our Sematext Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it points me to OrdinalMap building (see attached screenshot). If I read the code right, an OrdinalMap is built with every facet. And it's expensive since there are many unique values in the shard (previously, there we more smaller shards, making latency better, but this approach doesn't scale for this particular use-case). If I'm right up to this point, I see a couple of potential improvements, [inspired from Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]: # Keep the OrdinalMap cached until the next softCommit, so that only the first query takes the penalty # Allow faceting on actual values (a Map) rather than ordinals, for situations like the one above where we have few matching documents. We could potentially auto-detect this scenario (e.g. by configuring a threshold) and use a Map when there are few documents I'm curious about what you're thinking: * would a PR/patch be welcome for any of the two ideas above? * do you see better options? am I missing something? was: I'm running against the following scenario: * [JSON] faceting on a high cardinality field * few matching documents => few unique values Yet the query almost always takes a long time. Here's an example taking almost 4s for ~300 documents and unique values (edited a bit): {code:java} "QTime":3869, "params":{ "json":"{\"query\": \"*:*\", \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", \"unique_id:49866\"] \"facet\": {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}", "rows":"0"}}, "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[] }, "facets":{ "count":333, "keywords":{ "buckets":[{ "val":"value1", "count":124}, ... {code} I did some [profiling with our Sematext Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it points me to OrdinalMap building (see attached screenshot). If I read the code right, an OrdinalMap is built with every facet. And it's expensive since there are many unique values in the shard (previously, there we more smaller shards, making latency better, but this approach doesn't scale for this particular use-case). If I'm right up to this point, I see a couple of potential improvements, [inspired from Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:] # Keep the OrdinalMap cached until the next softCommit, so that only the first query takes the penalty # Allow faceting on actual values (a Map) rather than ordinals, for situations like the one above where we have few matching documents. We could potentially auto-detect this scenario (e.g. by configuring a threshold) and use a Map when there are few documents I'm curious about what you're thinking: * would a PR/patch be welcome for any of the two ideas above? * do you see better options? am I missing something? > Avoid building OrdinalMap for each facet > > > Key: SOLR-15008 > URL: https://issues.apache.org/jira/browse/SOLR-15008 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module >Affects Versions: 8.7 >Reporter: Radu Gheorghe >Priority: Major > Labels: performance > Attachments: Screenshot 2020-11-19 at 12.01.55.png > > > I'm running against the following scenario: > * [JSON] faceting on a high cardinality field > * few matching documents => few unique values > Yet the query almost always takes a long time. Here's an example taking > almost 4s for ~300 documents and unique values (edited a bit): > > {code:java} > "QTime"
[jira] [Updated] (SOLR-15008) Avoid building OrdinalMap for each facet
[ https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radu Gheorghe updated SOLR-15008: - Description: I'm running against the following scenario: * [JSON] faceting on a high cardinality field * few matching documents => few unique values Yet the query almost always takes a long time. Here's an example taking almost 4s for ~300 documents and unique values (edited a bit): {code:java} "QTime":3869, "params":{ "json":"{\"query\": \"*:*\", \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", \"unique_id:49866\"] \"facet\": {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}", "rows":"0"}}, "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[] }, "facets":{ "count":333, "keywords":{ "buckets":[{ "val":"value1", "count":124}, ... {code} I did some [profiling with our Sematext Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it points me to OrdinalMap building (see attached screenshot). If I read the code right, an OrdinalMap is built with every facet. And it's expensive since there are many unique values in the shard (previously, there we more smaller shards, making latency better, but this approach doesn't scale for this particular use-case). If I'm right up to this point, I see a couple of potential improvements, [inspired from Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:] # Keep the OrdinalMap cached until the next softCommit, so that only the first query takes the penalty # Allow faceting on actual values (a Map) rather than ordinals, for situations like the one above where we have few matching documents. We could potentially auto-detect this scenario (e.g. by configuring a threshold) and use a Map when there are few documents I'm curious about what you're thinking: * would a PR/patch be welcome for any of the two ideas above? * do you see better options? am I missing something? was: I'm running against the following scenario: * [JSON] faceting on a high cardinality field * few matching documents => few unique values Yet the query almost always takes a long time. Here's an example taking almost 4s for ~300 documents and unique values (edited a bit): {code:java} "QTime":3869, "params":{ "json":"{\"query\": \"*:*\", \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", \"unique_id:49866\"] \"facet\": {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}", "rows":"0"}}, "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[] }, "facets":{ "count":333, "keywords":{ "buckets":[{ "val":"value1", "count":124}, ... {code} I did some [profiling with our Sematext Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it points me to OrdinalMap building. If I read the code right, an OrdinalMap is built with every facet. And it's expensive since there are many unique values in the shard (previously, there we more smaller shards, making latency better, but this approach doesn't scale for this particular use-case). If I'm right up to this point, I see a couple of potential improvements, [inspired from Elasticsearch|[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint]:] # Keep the OrdinalMap cached until the next softCommit, so that only the first query takes the penalty # Allow faceting on actual values (a Map) rather than ordinals, for situations like the one above where we have few matching documents. We could potentially auto-detect this scenario (e.g. by configuring a threshold) and use a Map when there are few documents I'm curious about what you're thinking: * would a PR/patch be welcome for any of the two ideas above? * do you see better options? am I missing something? > Avoid building OrdinalMap for each facet > > > Key: SOLR-15008 > URL: https://issues.apache.org/jira/browse/SOLR-15008 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module >Affects Versions: 8.7 >Reporter: Radu Gheorghe >Priority: Major > Labels: performance > Attachments: Screenshot 2020-11-19 at 12.01.55.png > > > I'm running against the following scenario: > * [JSON] faceting on a high cardinality field > * few matching documents => few unique values > Yet the query almost always takes a long time. Here's an example taking > almost
[jira] [Updated] (SOLR-15008) Avoid building OrdinalMap for each facet
[ https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radu Gheorghe updated SOLR-15008: - Description: I'm running against the following scenario: * [JSON] faceting on a high cardinality field * few matching documents => few unique values Yet the query almost always takes a long time. Here's an example taking almost 4s for ~300 documents and unique values (edited a bit): {code:java} "QTime":3869, "params":{ "json":"{\"query\": \"*:*\", \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", \"unique_id:49866\"] \"facet\": {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}", "rows":"0"}}, "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[] }, "facets":{ "count":333, "keywords":{ "buckets":[{ "val":"value1", "count":124}, ... {code} I did some [profiling with our Sematext Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it points me to OrdinalMap building (see attached screenshot). If I read the code right, an OrdinalMap is built with every facet. And it's expensive since there are many unique values in the shard (previously, there we more smaller shards, making latency better, but this approach doesn't scale for this particular use-case). If I'm right up to this point, I see a couple of potential improvements, [inspired from Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]: # *Keep the OrdinalMap cached until the next softCommit*, so that only the first query takes the penalty # *Allow faceting on actual values (a Map) rather than ordinals*, for situations like the one above where we have few matching documents. We could potentially auto-detect this scenario (e.g. by configuring a threshold) and use a Map when there are few documents I'm curious about what you're thinking: * would a PR/patch be welcome for any of the two ideas above? * do you see better options? am I missing something? was: I'm running against the following scenario: * [JSON] faceting on a high cardinality field * few matching documents => few unique values Yet the query almost always takes a long time. Here's an example taking almost 4s for ~300 documents and unique values (edited a bit): {code:java} "QTime":3869, "params":{ "json":"{\"query\": \"*:*\", \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", \"unique_id:49866\"] \"facet\": {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}", "rows":"0"}}, "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[] }, "facets":{ "count":333, "keywords":{ "buckets":[{ "val":"value1", "count":124}, ... {code} I did some [profiling with our Sematext Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it points me to OrdinalMap building (see attached screenshot). If I read the code right, an OrdinalMap is built with every facet. And it's expensive since there are many unique values in the shard (previously, there we more smaller shards, making latency better, but this approach doesn't scale for this particular use-case). If I'm right up to this point, I see a couple of potential improvements, [inspired from Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]: # Keep the OrdinalMap cached until the next softCommit, so that only the first query takes the penalty # Allow faceting on actual values (a Map) rather than ordinals, for situations like the one above where we have few matching documents. We could potentially auto-detect this scenario (e.g. by configuring a threshold) and use a Map when there are few documents I'm curious about what you're thinking: * would a PR/patch be welcome for any of the two ideas above? * do you see better options? am I missing something? > Avoid building OrdinalMap for each facet > > > Key: SOLR-15008 > URL: https://issues.apache.org/jira/browse/SOLR-15008 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module >Affects Versions: 8.7 >Reporter: Radu Gheorghe >Priority: Major > Labels: performance > Attachments: Screenshot 2020-11-19 at 12.01.55.png > > > I'm running against the following scenario: > * [JSON] faceting on a high cardinality field > * few matching documents => few unique values > Yet the query almost always takes a long time. Here's an example taking > almost 4s for ~300 documents and unique values (edited a bit): > > {code:java} > "QTi
[jira] [Commented] (LUCENE-9431) UnifiedHighlighter: Make WEIGHT_MATCHES the default
[ https://issues.apache.org/jira/browse/LUCENE-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235360#comment-17235360 ] Yury Hohin commented on LUCENE-9431: Hi, I want to help to solve this issue. Could you please assign this task to me? > UnifiedHighlighter: Make WEIGHT_MATCHES the default > --- > > Key: LUCENE-9431 > URL: https://issues.apache.org/jira/browse/LUCENE-9431 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Blocker > Fix For: master (9.0) > > > This mode uses Lucene's modern mechanism of exposing information that > previously required complicated highlighting machinery. It's also likely to > generally work better out-of-the-box and with custom queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9431) UnifiedHighlighter: Make WEIGHT_MATCHES the default
[ https://issues.apache.org/jira/browse/LUCENE-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235385#comment-17235385 ] Erick Erickson commented on LUCENE-9431: Yuri: The Jira system only allows assigning to committers, telling us that you're working on it is enough. When you're ready create a pull request (preferred) or attach a patch, whichever you're more comfortable with. Then, assuming all is well, a committer can pick it up and push it to the repo. You may have to nudge us a bit if it languishes... And thanks! > UnifiedHighlighter: Make WEIGHT_MATCHES the default > --- > > Key: LUCENE-9431 > URL: https://issues.apache.org/jira/browse/LUCENE-9431 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Blocker > Fix For: master (9.0) > > > This mode uses Lucene's modern mechanism of exposing information that > previously required complicated highlighting machinery. It's also likely to > generally work better out-of-the-box and with custom queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9614) Implement KNN Query
[ https://issues.apache.org/jira/browse/LUCENE-9614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235437#comment-17235437 ] Adrien Grand commented on LUCENE-9614: -- I wonder if we should use the Query API at all for nearest-neighbor search. Today the Query API assumes that you can figure out whether a document matches in isolation, regardless of other matches in the index/segment. Maybe we should have a new top-level API on IndexSearcher, something like `IndexSearcher#nearestNeighbors(String field, float[] target)`, which we could later expand into `IndexSearcher#nearestNeighbors(String field, float[] target, Query filter)` to add support for filtering? > Implement KNN Query > --- > > Key: LUCENE-9614 > URL: https://issues.apache.org/jira/browse/LUCENE-9614 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > > Now we have a vector index format, and one vector indexing/KNN search > implementation, but the interface is low-level: you can search across a > single segment only. We would like to expose a Query implementation. > Initially, we want to support a usage where the KnnVectorQuery selects the > k-nearest neighbors without regard to any other constraints, and these can > then be filtered as part of an enclosing Boolean or other query. > Later we will want to explore some kind of filtering *while* performing > vector search, or a re-entrant search process that can yield further results. > Because of the nature of knn search (all documents having any vector value > match), it is more like a ranking than a filtering operation, and it doesn't > really make sense to provide an iterator interface that can be merged in the > usual way, in docid order, skipping ahead. It's not yet clear how to satisfy > a query that is "k nearest neighbors satsifying some arbitrary Query", at > least not without realizing a complete bitset for the Query. But this is for > a later issue; *this* issue is just about performing the knn search in > isolation, computing a set of (some given) K nearest neighbors, and providing > an iterator over those. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msfroh commented on pull request #2088: LUCENE-9617: Reset lowestUnassignedFieldNumber in FieldNumbers.clear()
msfroh commented on pull request #2088: URL: https://github.com/apache/lucene-solr/pull/2088#issuecomment-730361695 > I'm suspicious that this is safe to do. What if another thread is calling addDocument at the same time? As long as `FieldNumbers.clear()` is only called from `IndexWriter.deleteAll()`, my understanding is that the safety is provided by the `try (Closeable finalizer = docWriter.lockAndAbortAll()) {` block, which (I think) guarantees that any concurrent indexing is blocked until the lock is released. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on a change in pull request #2088: LUCENE-9617: Reset lowestUnassignedFieldNumber in FieldNumbers.clear()
dsmiley commented on a change in pull request #2088: URL: https://github.com/apache/lucene-solr/pull/2088#discussion_r526893867 ## File path: lucene/CHANGES.txt ## @@ -184,6 +184,9 @@ Bug fixes * LUCENE-9365: FuzzyQuery was missing matches when prefix length was equal to the term length (Mark Harwood, Mike Drob) +* LUCENE-9617: Reset lowestUnassignedFieldNumber on FieldNumbers.clear(), to avoid leaking Review comment: Can you rewrite in terms of what a user might understand? e.g. `IndexWriter.deleteAll now resets internal field numbers; prevents ever-increasing numbers in unusual use-cases` The latter part of what you wrote isn't bad but the first part is technical mumbo-jumbo that only Lucene deep divers would even recognize. ## File path: lucene/core/src/test/org/apache/lucene/index/TestFieldInfos.java ## @@ -187,4 +187,23 @@ public void testMergedFieldInfos_singleLeaf() throws IOException { writer.close(); dir.close(); } + + public void testFieldNumbersAutoIncrement() { +FieldInfos.FieldNumbers fieldNumbers = new FieldInfos.FieldNumbers("softDeletes"); +for (int i = 0; i < 10; i++) { + fieldNumbers.addOrGet("field" + i, -1, IndexOptions.NONE, DocValuesType.NONE, + 0, 0, 0, 0, + VectorValues.SearchStrategy.NONE, false); +} +int idx = fieldNumbers.addOrGet("EleventhField", -1, IndexOptions.NONE, DocValuesType.NONE, +0, 0, 0, 0, +VectorValues.SearchStrategy.NONE, false); +assertEquals("Field numbers 0 through 9 were allocated", 10, idx); + +fieldNumbers.clear(); Review comment: My only problem with unit tests like this is that it doesn't test what we _really_ want to know -- that when IW.deleteAll() is called (a user level thing, fieldNumbers.clear() is not), that the field numbers get reset. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] rmuir commented on pull request #2088: LUCENE-9617: Reset lowestUnassignedFieldNumber in FieldNumbers.clear()
rmuir commented on pull request #2088: URL: https://github.com/apache/lucene-solr/pull/2088#issuecomment-730387967 > As long as FieldNumbers.clear() is only called from IndexWriter.deleteAll(), my understanding is that the safety is provided by the try (Closeable finalizer = docWriter.lockAndAbortAll()) { block, which (I think) guarantees that any concurrent indexing is blocked until the lock is released. Thanks, maybe there is a way to improve the testing of `deleteAll` to better enforce this? Lucene is not testing this method much, but I know some users (e.g. Solr) are using it often. My concern is some race condition that ultimately creates segments with unaligned field numbers. This would be a disaster and definitely result in corruption (think, stored fields merging etc which copies binary/compressed data directly). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15008) Avoid building OrdinalMap for each facet
[ https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235549#comment-17235549 ] Michael Gibney commented on SOLR-15008: --- Interesting; I'm surprised that profiling indicated {{OrdinalMap}} building, since I'm pretty sure the {{OrdinalMap}} instances (as accessed via {{FacetFieldProcessorByArrayDV}} are already cached in the way you're suggesting: # in [FacetFieldProcessorByArrayDV.findStartAndEndOrds(...)|https://github.com/apache/lucene-solr/blob/40e2122b5a5b89f446e51692ef0d72e48c7b71e5/solr/core/src/java/org/apache/solr/search/facet/FacetFieldProcessorByArrayDV.java#L60] # in [FieldUtil.getSortedSetDocValues(...)|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/search/facet/FieldUtil.java#L55] # in [SlowCompositeReaderWrapper.getSortedSetDocValues(...)|https://github.com/apache/lucene-solr/blob/c02f07f2d5db5c983c2eedf71febf9516189595d/solr/core/src/java/org/apache/solr/index/SlowCompositeReaderWrapper.java#L197-L211] Do you have more information about the total numbers involved (high-cardinality field -- specifically how high per core? how many documents overall per core? how many cores? does the latency manifest even across a single indexSearcher -- i.e., no intervening updates?). A couple of things that might be worth doing in the meantime, just as a sanity check: # disable refinement for the facet field ({{"refinement":"none"}}) -- among other things, this would take the {{filterCache}} out of the equation # if possible, try optimizing each replica to a single segment, which should take {{OrdinalMap}} out of the equation (this of course strictly diagnostic, not a "workaround" suggestion). {quote}Allow faceting on actual values (a Map) rather than ordinals {quote} Interesting -- even if {{OrdinalMap}} is already getting cached (as I think it is?), this would be one way to avoid the overhead of allocating a {{CountSlotArrAcc}} backed by an int array of a size matching the field cardinality (this is why I asked more specifically about the cardinality of the field involved). I'm not sure how big a problem this is in practice, but I imagine a value-Map-based faceting implementation would probably perform better for this type of use case ... not 100% sure though, and not sure how _much_ better ... (I think {{FacetFieldProcessorByHashDV}} was designed to meet this a similar use case, but it only works for single-valued fields). > Avoid building OrdinalMap for each facet > > > Key: SOLR-15008 > URL: https://issues.apache.org/jira/browse/SOLR-15008 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module >Affects Versions: 8.7 >Reporter: Radu Gheorghe >Priority: Major > Labels: performance > Attachments: Screenshot 2020-11-19 at 12.01.55.png > > > I'm running against the following scenario: > * [JSON] faceting on a high cardinality field > * few matching documents => few unique values > Yet the query almost always takes a long time. Here's an example taking > almost 4s for ~300 documents and unique values (edited a bit): > > {code:java} > "QTime":3869, > "params":{ > "json":"{\"query\": \"*:*\", > \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", > \"unique_id:49866\"] > \"facet\": > {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}", > "rows":"0"}}, > > "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[] > }, > "facets":{ > "count":333, > "keywords":{ > "buckets":[{ > "val":"value1", > "count":124}, > ... > {code} > I did some [profiling with our Sematext > Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it > points me to OrdinalMap building (see attached screenshot). If I read the > code right, an OrdinalMap is built with every facet. And it's expensive since > there are many unique values in the shard (previously, there we more smaller > shards, making latency better, but this approach doesn't scale for this > particular use-case). > If I'm right up to this point, I see a couple of potential improvements, > [inspired from > Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]: > # *Keep the OrdinalMap cached until the next softCommit*, so that only the > first query takes the penalty > # *Allow faceting on actual values (a Map) rather than ordinals*, for > situations like the one above where we have few matching documents. We could > potentially auto-detect this scenario (e.g. by configuring a threshold) and > use a M
[jira] [Commented] (SOLR-14560) Learning To Rank Interleaving
[ https://issues.apache.org/jira/browse/SOLR-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235554#comment-17235554 ] ASF subversion and git services commented on SOLR-14560: Commit 85297846419c626585dd26efe70d6eb031a4b3c9 in lucene-solr's branch refs/heads/branch_8x from Christine Poerschke [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8529784 ] SOLR-14560: javadocs link tweak in Solr Ref Guide (branch_8x only) > Learning To Rank Interleaving > - > > Key: SOLR-14560 > URL: https://issues.apache.org/jira/browse/SOLR-14560 > Project: Solr > Issue Type: New Feature > Components: contrib - LTR >Affects Versions: 8.5.2 >Reporter: Alessandro Benedetti >Priority: Minor > Fix For: master (9.0), 8.8 > > Time Spent: 10h 10m > Remaining Estimate: 0h > > Interleaving is an approach to Online Search Quality evaluation that can be > very useful for Learning To Rank models: > [https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html|https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html] > Scope of this issue is to introduce the ability to the LTR query parser of > accepting multiple models (2 to start with). > If one model is passed, normal reranking happens. > If two models are passed, reranking happens for both models and the final > reranked list is the interleaved sequence of results coming from the two > models lists. > As a first step it is going to be implemented through: > TeamDraft Interleaving with two models in input. > In the future, we can expand the functionality adding the interleaving > algorithm as a parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing
[ https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235576#comment-17235576 ] Mark Robert Miller commented on SOLR-14788: --- {quote}Mark, I have something of extremely narrow focus that I would like to bring up. {quote} Hey Shawn, sorry, I only recently saw this. So the real deal with the Overseer is that it ended up a disaster. I won't get into it again, but when I agreed we needed a node(s) with these types of capabilities, it was to fulfill a part of the design that I intended to work on reaching. Essentially, I changed jobs as SolrCloud was released and my new job was hdfs and related, not solrcloud design and impl and finishing, etc. Sure, I took time to try and make the thing float, but it was not my work load or direct task(s). So the Overseer just doesn't work at all like I envisioned, it doesn't solve the problem I envisioned, in fact, it made things worse in pretty much every regard compared to what we had. It was intended to be an optimization and coordination point that enhanced the system vs the naive path. That panned out pretty much 0. So when you talk about all these state updates and ZooKeeper queues, and slow restarts, and lost overseers, and scalability and all that, it really hardly applies. We hired the Overseer to be a farmer and instead he was a tractor. Trying to solve for those silly looping threads and crazy number of state updates and blocking/locking/slow behavior has been 100% the wrong approach. Instead, we hire a farmer and this time make sure he is a farmer first. > Solr: The Next Big Thing > > > Key: SOLR-14788 > URL: https://issues.apache.org/jira/browse/SOLR-14788 > Project: Solr > Issue Type: Task >Reporter: Mark Robert Miller >Assignee: Mark Robert Miller >Priority: Critical > > h3. > [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The > Policeman is on duty!*{color} > {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and > have some fun. Try to make some progress. Don't stress too much about the > impact of your changes or maintaining stability and performance and > correctness so much. Until the end of phase 1, I've got your back. I have a > variety of tools and contraptions I have been building over the years and I > will continue training them on this branch. I will review your changes and > peer out across the land and course correct where needed. As Mike D will be > thinking, "Sounds like a bottleneck Mark." And indeed it will be to some > extent. Which is why once stage one is completed, I will flip The Policeman > to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} > *down for some vigilante justice, but I won't be walking the beat, all that > stuff about sit back and relax goes out the window.*{color}_ > {quote} > > I have stolen this title from Ishan or Noble and Ishan. > This issue is meant to capture the work of a small team that is forming to > push Solr and SolrCloud to the next phase. > I have kicked off the work with an effort to create a very fast and solid > base. That work is not 100% done, but it's ready to join the fight. > Tim Potter has started giving me a tremendous hand in finishing up. Ishan and > Noble have already contributed support and testing and have plans for > additional work to shore up some of our current shortcomings. > Others have expressed an interest in helping and hopefully they will pop up > here as well. > Let's organize and discuss our efforts here and in various sub issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude commented on pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream
thelabdude commented on pull request #2067: URL: https://github.com/apache/lucene-solr/pull/2067#issuecomment-730490848 Hi @joel-bernstein wasn't able to assign you as a reviewer on this, but would love for you to take a look when convenient. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream
thelabdude commented on a change in pull request #2067: URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527028199 ## File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java ## @@ -334,88 +330,76 @@ private StreamComparator parseComp(String sort, String fl) throws IOException { public static Slice[] getSlices(String collectionName, ZkStateReader zkStateReader, boolean checkAlias) throws IOException { ClusterState clusterState = zkStateReader.getClusterState(); -Map collectionsMap = clusterState.getCollectionsMap(); - -//TODO we should probably split collection by comma to query more than one -// which is something already supported in other parts of Solr - // check for alias or collection List allCollections = new ArrayList<>(); String[] collectionNames = collectionName.split(","); +Aliases aliases = checkAlias ? zkStateReader.getAliases() : null; + for(String col : collectionNames) { - List collections = checkAlias - ? zkStateReader.getAliases().resolveAliases(col) // if not an alias, returns collectionName + List collections = (aliases != null) + ? aliases.resolveAliases(col) // if not an alias, returns collectionName : Collections.singletonList(collectionName); allCollections.addAll(collections); } // Lookup all actives slices for these collections List slices = allCollections.stream() -.map(collectionsMap::get) +.map(c -> clusterState.getCollectionOrNull(c, true)) .filter(Objects::nonNull) .flatMap(docCol -> Arrays.stream(docCol.getActiveSlicesArr())) .collect(Collectors.toList()); if (!slices.isEmpty()) { - return slices.toArray(new Slice[slices.size()]); -} - -// Check collection case insensitive -for(Entry entry : collectionsMap.entrySet()) { - if(entry.getKey().equalsIgnoreCase(collectionName)) { -return entry.getValue().getActiveSlicesArr(); - } + return slices.toArray(new Slice[0]); } throw new IOException("Slices not found for " + collectionName); } protected void constructStreams() throws IOException { +final ModifiableSolrParams mParams = adjustParams(new ModifiableSolrParams(params)); +mParams.set(DISTRIB, "false"); // We are the aggregator. try { + final Stream streamOfSolrStream; + if (streamContext != null && streamContext.get("shards") != null) { +// stream of shard url with core +streamOfSolrStream = getShards(this.zkHost, this.collection, this.streamContext, mParams).stream() +.map(s -> new SolrStream(s, mParams)); + } else { +// stream of replicas to reuse the same SolrHttpClient per baseUrl +// avoids re-parsing data we already have in the replicas +streamOfSolrStream = getReplicas(this.zkHost, this.collection, this.streamContext, mParams).stream() Review comment: Here we're keeping the Replica so we have direct access to its baseUrl and core name instead of parsing those out of the shardUrl This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream
thelabdude commented on a change in pull request #2067: URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527029201 ## File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/SolrStream.java ## @@ -268,8 +275,7 @@ private Map mapFields(Map fields, Map mappings) { return fields; } - // temporary... - public TupleStreamParser constructParser(SolrClient server, SolrParams requestParams) throws IOException, SolrServerException { + private TupleStreamParser constructParser(SolrParams requestParams) throws IOException, SolrServerException { Review comment: Didn't seem like this method needed to be public and we already get a SolrClient in the open method, so no need to pass it. However, this breaks a public method signature, so is only for Solr 9.x and shouldn't be back-ported to 8.x This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream
thelabdude commented on a change in pull request #2067: URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527027234 ## File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java ## @@ -334,88 +330,76 @@ private StreamComparator parseComp(String sort, String fl) throws IOException { public static Slice[] getSlices(String collectionName, ZkStateReader zkStateReader, boolean checkAlias) throws IOException { ClusterState clusterState = zkStateReader.getClusterState(); -Map collectionsMap = clusterState.getCollectionsMap(); - -//TODO we should probably split collection by comma to query more than one -// which is something already supported in other parts of Solr - // check for alias or collection List allCollections = new ArrayList<>(); String[] collectionNames = collectionName.split(","); +Aliases aliases = checkAlias ? zkStateReader.getAliases() : null; + for(String col : collectionNames) { - List collections = checkAlias - ? zkStateReader.getAliases().resolveAliases(col) // if not an alias, returns collectionName + List collections = (aliases != null) + ? aliases.resolveAliases(col) // if not an alias, returns collectionName : Collections.singletonList(collectionName); allCollections.addAll(collections); } // Lookup all actives slices for these collections List slices = allCollections.stream() -.map(collectionsMap::get) +.map(c -> clusterState.getCollectionOrNull(c, true)) .filter(Objects::nonNull) .flatMap(docCol -> Arrays.stream(docCol.getActiveSlicesArr())) .collect(Collectors.toList()); if (!slices.isEmpty()) { - return slices.toArray(new Slice[slices.size()]); -} - -// Check collection case insensitive -for(Entry entry : collectionsMap.entrySet()) { Review comment: I removed this b/c I don't think we should try to accommodate improperly cased collection names. No tests broke, but let me know if we need this for some reason I don't understand This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream
thelabdude commented on a change in pull request #2067: URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527027557 ## File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java ## @@ -334,11 +334,6 @@ private StreamComparator parseComp(String sort, String fl) throws IOException { public static Slice[] getSlices(String collectionName, ZkStateReader zkStateReader, boolean checkAlias) throws IOException { ClusterState clusterState = zkStateReader.getClusterState(); -Map collectionsMap = clusterState.getCollectionsMap(); - -//TODO we should probably split collection by comma to query more than one -// which is something already supported in other parts of Solr - // check for alias or collection Review comment: Moved the call to getAliases out of the for loop This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14035) remove deprecated preferLocalShards references
[ https://issues.apache.org/jira/browse/SOLR-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235629#comment-17235629 ] Christine Poerschke commented on SOLR-14035: Hello [~Alexey Bulygin], welcome! Thank you for the attached patch, it looks good to me and I'll proceed to commit it to the repo shortly. > remove deprecated preferLocalShards references > -- > > Key: SOLR-14035 > URL: https://issues.apache.org/jira/browse/SOLR-14035 > Project: Solr > Issue Type: Task >Reporter: Christine Poerschke >Priority: Blocker > Fix For: master (9.0) > > Attachments: SOLR-14035.patch > > > {{preferLocalShards}} support was added under SOLR-6832 in version 5.1 > (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.1.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L223-L226) > and deprecated under SOLR-11982 in version 7.4 > (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.4.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L265-L269) > This ticket is to fully remove {{preferLocalShards}} references in code, > tests and documentation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-14035) remove deprecated preferLocalShards references
[ https://issues.apache.org/jira/browse/SOLR-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christine Poerschke reassigned SOLR-14035: -- Assignee: Christine Poerschke > remove deprecated preferLocalShards references > -- > > Key: SOLR-14035 > URL: https://issues.apache.org/jira/browse/SOLR-14035 > Project: Solr > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Blocker > Fix For: master (9.0) > > Attachments: SOLR-14035.patch > > > {{preferLocalShards}} support was added under SOLR-6832 in version 5.1 > (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.1.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L223-L226) > and deprecated under SOLR-11982 in version 7.4 > (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.4.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L265-L269) > This ticket is to fully remove {{preferLocalShards}} references in code, > tests and documentation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14035) remove deprecated preferLocalShards references
[ https://issues.apache.org/jira/browse/SOLR-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235638#comment-17235638 ] ASF subversion and git services commented on SOLR-14035: Commit c4d4767bca196ad358b72156889effd27fdfcc9b in lucene-solr's branch refs/heads/master from Christine Poerschke [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c4d4767 ] SOLR-14035: Remove deprecated preferLocalShards=true support in favour of the shards.preference=replica.location:local alternative. (Alex Bulygin via Christine Poerschke) > remove deprecated preferLocalShards references > -- > > Key: SOLR-14035 > URL: https://issues.apache.org/jira/browse/SOLR-14035 > Project: Solr > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Blocker > Fix For: master (9.0) > > Attachments: SOLR-14035.patch > > > {{preferLocalShards}} support was added under SOLR-6832 in version 5.1 > (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.1.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L223-L226) > and deprecated under SOLR-11982 in version 7.4 > (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.4.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L265-L269) > This ticket is to fully remove {{preferLocalShards}} references in code, > tests and documentation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14035) remove deprecated preferLocalShards references
[ https://issues.apache.org/jira/browse/SOLR-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christine Poerschke resolved SOLR-14035. Resolution: Fixed > remove deprecated preferLocalShards references > -- > > Key: SOLR-14035 > URL: https://issues.apache.org/jira/browse/SOLR-14035 > Project: Solr > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Blocker > Fix For: master (9.0) > > Attachments: SOLR-14035.patch > > > {{preferLocalShards}} support was added under SOLR-6832 in version 5.1 > (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.1.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L223-L226) > and deprecated under SOLR-11982 in version 7.4 > (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.4.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L265-L269) > This ticket is to fully remove {{preferLocalShards}} references in code, > tests and documentation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9616) Improve test coverage for internal format versions
[ https://issues.apache.org/jira/browse/LUCENE-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235643#comment-17235643 ] Adrien Grand commented on LUCENE-9616: -- I had not though about naming the class Lucene88DocValuesConsumer. I like the simpler naming scheme but wonder that it might be confusing to make the name of the consumer diverge from the name of the format (the string that is passed to "super" in the DocValuesFormat constructor). And absolutely, these should be in backward codecs. For this particular change I'm even considering creating a new Lucene88DocValuesFormat given that it changes quite significantly the file format. > Improve test coverage for internal format versions > -- > > Key: LUCENE-9616 > URL: https://issues.apache.org/jira/browse/LUCENE-9616 > Project: Lucene - Core > Issue Type: Test >Reporter: Julie Tibshirani >Priority: Minor > > Some formats use an internal versioning system -- for example > {{CompressingStoredFieldsFormat}} maintains older logic for reading an > on-heap fields index. Because we always allow reading segments from the > current + previous major version, some users still rely on the read-side > logic of older internal versions. > Although the older version logic is covered by > {{TestBackwardsCompatibility}}, it looks like it's not exercised in unit > tests. Older versions aren't "in rotation" when choosing a random codec for > tests. They also don't have dedicated unit tests as we have for separate > older formats, for example {{TestLucene60PointsFormat}}. > It could be good to improve unit test coverage for the older versions, since > they're in active use. A downside is that it's not straightforward to add > unit tests, since we tend to just change/ delete the old write-side logic as > we bump internal versions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test
Haoyu Zhai created LUCENE-9618: -- Summary: Improve IntervalIterator.nextInterval's behavior/documentation/test Key: LUCENE-9618 URL: https://issues.apache.org/jira/browse/LUCENE-9618 Project: Lucene - Core Issue Type: Improvement Components: modules/query Reporter: Haoyu Zhai I'm trying to play around with my own {{IntervalSource}} and found out that {{nextInterval}} method of IntervalIterator will be called sometimes even after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is calling an inner iterator's {{nextInterval}} regardless of what the result of {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s implementation do considered the case where {{nextInterval}} is called after {{nextDoc}} returns NO_MORE_DOCS. We should probably update the javadoc and test if the behavior is necessary. Or we should change the current implementation to avoid this behavior original email discussion thread: https://markmail.org/message/7itbwk6ts3bo3gdh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] zhaih opened a new pull request #2090: LUCENE-9618: demo unit test
zhaih opened a new pull request #2090: URL: https://github.com/apache/lucene-solr/pull/2090 # Description This PR is not intended to be merged. It's just for demonstration of issues mentioned in [LUCENE-9618](https://issues.apache.org/jira/browse/LUCENE-9618) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test
[ https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235652#comment-17235652 ] Haoyu Zhai commented on LUCENE-9618: I created a [PR|https://github.com/apache/lucene-solr/pull/2090] with a simple test case to demonstrate the issue mentioned. > Improve IntervalIterator.nextInterval's behavior/documentation/test > --- > > Key: LUCENE-9618 > URL: https://issues.apache.org/jira/browse/LUCENE-9618 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/query >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I'm trying to play around with my own {{IntervalSource}} and found out that > {{nextInterval}} method of IntervalIterator will be called sometimes even > after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. > > After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is > calling an inner iterator's {{nextInterval}} regardless of what the result of > {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s > implementation do considered the case where {{nextInterval}} is called after > {{nextDoc}} returns NO_MORE_DOCS. > > We should probably update the javadoc and test if the behavior is necessary. > Or we should change the current implementation to avoid this behavior > original email discussion thread: > https://markmail.org/message/7itbwk6ts3bo3gdh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13671) Remove check for bare "var" declarations in validate-source-patterns in before releasing Solr 9.0
[ https://issues.apache.org/jira/browse/SOLR-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235656#comment-17235656 ] Christine Poerschke commented on SOLR-13671: bq. ... lucene/tools/src/groovy/check-source-patterns.groovy ... I see the file got renamed in the https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=def82ab commit on {{master}} branch but remains present on {{branch_8x}} branch. [~erickerickson], do you perhaps recall if the changes in this JIRA task here were intended for {{master}} branch or {{branch_8x}} or both? Usually -- https://cwiki.apache.org/confluence/display/SOLR/HowToContribute -- our development is on master branch and then gets backported, but perhaps this scenario here is different (I haven't looked yet at the discussion in the linked JIRA) and I note that [~Alexey Bulygin]'s patch can be {{cd lucene ; git apply}} applied to branch_8x, hence asking. Hope that helps. > Remove check for bare "var" declarations in validate-source-patterns in > before releasing Solr 9.0 > - > > Key: SOLR-13671 > URL: https://issues.apache.org/jira/browse/SOLR-13671 > Project: Solr > Issue Type: Improvement >Reporter: Erick Erickson >Priority: Blocker > Fix For: master (9.0) > > Attachments: SOLR-13671.patch > > > See the discussion in the linked JIRA. > Remove the line: > (~$/\n\s*var\s+/$) : 'var is not allowed in until we stop development on the > 8x code line' > in > invalidJavaOnlyPatterns > from lucene/tools/src/groovy/check-source-patterns.groovy -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test
[ https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235675#comment-17235675 ] Michael McCandless commented on LUCENE-9618: Hmm it is weird that these queries call {{nextInterval}} even after {{nextDoc}} returned {{NO_MORE_DOCS}}? Normally for Lucene DISI iterators, once {{NO_MORE_DOCS}} is returned, the iterator is done (in an undefined state) and you cannot call further methods on it. > Improve IntervalIterator.nextInterval's behavior/documentation/test > --- > > Key: LUCENE-9618 > URL: https://issues.apache.org/jira/browse/LUCENE-9618 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/query >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I'm trying to play around with my own {{IntervalSource}} and found out that > {{nextInterval}} method of IntervalIterator will be called sometimes even > after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. > > After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is > calling an inner iterator's {{nextInterval}} regardless of what the result of > {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s > implementation do considered the case where {{nextInterval}} is called after > {{nextDoc}} returns NO_MORE_DOCS. > > We should probably update the javadoc and test if the behavior is necessary. > Or we should change the current implementation to avoid this behavior > original email discussion thread: > https://markmail.org/message/7itbwk6ts3bo3gdh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test
[ https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235677#comment-17235677 ] Michael McCandless commented on LUCENE-9618: And thank you [~zhai7631] for the PR showing the issue! > Improve IntervalIterator.nextInterval's behavior/documentation/test > --- > > Key: LUCENE-9618 > URL: https://issues.apache.org/jira/browse/LUCENE-9618 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/query >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I'm trying to play around with my own {{IntervalSource}} and found out that > {{nextInterval}} method of IntervalIterator will be called sometimes even > after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. > > After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is > calling an inner iterator's {{nextInterval}} regardless of what the result of > {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s > implementation do considered the case where {{nextInterval}} is called after > {{nextDoc}} returns NO_MORE_DOCS. > > We should probably update the javadoc and test if the behavior is necessary. > Or we should change the current implementation to avoid this behavior > original email discussion thread: > https://markmail.org/thread/aytal77bgzl2zafm -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test
[ https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyu Zhai updated LUCENE-9618: --- Description: I'm trying to play around with my own {{IntervalSource}} and found out that {{nextInterval}} method of IntervalIterator will be called sometimes even after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is calling an inner iterator's {{nextInterval}} regardless of what the result of {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s implementation do considered the case where {{nextInterval}} is called after {{nextDoc}} returns NO_MORE_DOCS. We should probably update the javadoc and test if the behavior is necessary. Or we should change the current implementation to avoid this behavior original email discussion thread: https://markmail.org/thread/aytal77bgzl2zafm was: I'm trying to play around with my own {{IntervalSource}} and found out that {{nextInterval}} method of IntervalIterator will be called sometimes even after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is calling an inner iterator's {{nextInterval}} regardless of what the result of {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s implementation do considered the case where {{nextInterval}} is called after {{nextDoc}} returns NO_MORE_DOCS. We should probably update the javadoc and test if the behavior is necessary. Or we should change the current implementation to avoid this behavior original email discussion thread: https://markmail.org/message/7itbwk6ts3bo3gdh > Improve IntervalIterator.nextInterval's behavior/documentation/test > --- > > Key: LUCENE-9618 > URL: https://issues.apache.org/jira/browse/LUCENE-9618 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/query >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I'm trying to play around with my own {{IntervalSource}} and found out that > {{nextInterval}} method of IntervalIterator will be called sometimes even > after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. > > After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is > calling an inner iterator's {{nextInterval}} regardless of what the result of > {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s > implementation do considered the case where {{nextInterval}} is called after > {{nextDoc}} returns NO_MORE_DOCS. > > We should probably update the javadoc and test if the behavior is necessary. > Or we should change the current implementation to avoid this behavior > original email discussion thread: > https://markmail.org/thread/aytal77bgzl2zafm -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9617) FieldNumbers.clear() should reset lowestUnassignedFieldNumber
[ https://issues.apache.org/jira/browse/LUCENE-9617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235680#comment-17235680 ] Michael McCandless commented on LUCENE-9617: Whoa, good catch [~msfroh]! I'll try to review your PR, thank you. > FieldNumbers.clear() should reset lowestUnassignedFieldNumber > - > > Key: LUCENE-9617 > URL: https://issues.apache.org/jira/browse/LUCENE-9617 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.7 >Reporter: Michael Froh >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > A call to IndexWriter.deleteAll() should completely reset the state of the > index. Part of that is a call to globalFieldNumbersMap.clear(), which purges > all knowledge of fields by clearing name -> number and number -> name maps. > However, it does not reset lowestUnassignedFieldNumber. > If we have loop that adds some documents, calls deleteAll(), adds documents, > etc. lowestUnassignedFieldNumber keeps counting up. Since FieldInfos > allocates an array for number -> FieldInfo, this array will get larger and > larger, effectively leaking memory. > We can fix this by resetting lowestUnassignedFieldNumber to -1 in > FieldNumbers.clear(). > I'll write a unit test and attach a patch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-6733) Umbrella issue - Solr as a standalone application
[ https://issues.apache.org/jira/browse/SOLR-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235711#comment-17235711 ] Houston Putman commented on SOLR-6733: -- Is there still interest in this idea? If so, I'd volunteer to help take it forward. > Umbrella issue - Solr as a standalone application > - > > Key: SOLR-6733 > URL: https://issues.apache.org/jira/browse/SOLR-6733 > Project: Solr > Issue Type: New Feature >Reporter: Shawn Heisey >Priority: Major > > Umbrella issue. > Solr should be a standalone application, where the main method is provided by > Solr source code. > Here are the major tasks I envision, if we choose to embed Jetty: > * Create org.apache.solr.start.Main (and possibly other classes in the same > package), to be placed in solr-start.jar. The Main class will contain the > main method that starts the embedded Jetty and Solr. I do not know how to > adjust the build system to do this successfully. > * Handle central configurations in code -- TCP port, SSL, and things like > web.xml. > * For each of these steps, clean up any test fallout. > * Handle cloud-related configurations in code -- port, hostname, protocol, > etc. Use the same information as the central configurations. > * Consider whether things like authentication need changes. > * Handle any remaining container configurations. > I am currently imagining this work happening in a new branch and ultimately > being applied only to master, not the stable branch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test
[ https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235738#comment-17235738 ] Alan Woodward commented on LUCENE-9618: --- Thank you for opening this [~zhai7631]. As you said on the mailing list, I misunderstood what you were saying. Calling `nextInterval()` after `nextDoc()` has returned NO_MORE_DOCS is definitely an error and we should fix that in FilteringIntervalIterator. > Improve IntervalIterator.nextInterval's behavior/documentation/test > --- > > Key: LUCENE-9618 > URL: https://issues.apache.org/jira/browse/LUCENE-9618 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/query >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I'm trying to play around with my own {{IntervalSource}} and found out that > {{nextInterval}} method of IntervalIterator will be called sometimes even > after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. > > After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is > calling an inner iterator's {{nextInterval}} regardless of what the result of > {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s > implementation do considered the case where {{nextInterval}} is called after > {{nextDoc}} returns NO_MORE_DOCS. > > We should probably update the javadoc and test if the behavior is necessary. > Or we should change the current implementation to avoid this behavior > original email discussion thread: > https://markmail.org/thread/aytal77bgzl2zafm -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-6733) Umbrella issue - Solr as a standalone application
[ https://issues.apache.org/jira/browse/SOLR-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235777#comment-17235777 ] David Smiley commented on SOLR-6733: In my mind, this is a controversial topic. We'd give up easy config of CORS or whatever... and I'm dubious on what benefit we gained. A strength of Solr is customizability. One might argue too much? But it's a trade-off that distinguishes it with differentiating advantages. > Umbrella issue - Solr as a standalone application > - > > Key: SOLR-6733 > URL: https://issues.apache.org/jira/browse/SOLR-6733 > Project: Solr > Issue Type: New Feature >Reporter: Shawn Heisey >Priority: Major > > Umbrella issue. > Solr should be a standalone application, where the main method is provided by > Solr source code. > Here are the major tasks I envision, if we choose to embed Jetty: > * Create org.apache.solr.start.Main (and possibly other classes in the same > package), to be placed in solr-start.jar. The Main class will contain the > main method that starts the embedded Jetty and Solr. I do not know how to > adjust the build system to do this successfully. > * Handle central configurations in code -- TCP port, SSL, and things like > web.xml. > * For each of these steps, clean up any test fallout. > * Handle cloud-related configurations in code -- port, hostname, protocol, > etc. Use the same information as the central configurations. > * Consider whether things like authentication need changes. > * Handle any remaining container configurations. > I am currently imagining this work happening in a new branch and ultimately > being applied only to master, not the stable branch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-6733) Umbrella issue - Solr as a standalone application
[ https://issues.apache.org/jira/browse/SOLR-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235779#comment-17235779 ] David Smiley commented on SOLR-6733: For example my colleagues added some specialized mTLS stuff without having to hack Solr. That was possible because of Jetty's configurability, which we leave exposed. > Umbrella issue - Solr as a standalone application > - > > Key: SOLR-6733 > URL: https://issues.apache.org/jira/browse/SOLR-6733 > Project: Solr > Issue Type: New Feature >Reporter: Shawn Heisey >Priority: Major > > Umbrella issue. > Solr should be a standalone application, where the main method is provided by > Solr source code. > Here are the major tasks I envision, if we choose to embed Jetty: > * Create org.apache.solr.start.Main (and possibly other classes in the same > package), to be placed in solr-start.jar. The Main class will contain the > main method that starts the embedded Jetty and Solr. I do not know how to > adjust the build system to do this successfully. > * Handle central configurations in code -- TCP port, SSL, and things like > web.xml. > * For each of these steps, clean up any test fallout. > * Handle cloud-related configurations in code -- port, hostname, protocol, > etc. Use the same information as the central configurations. > * Consider whether things like authentication need changes. > * Handle any remaining container configurations. > I am currently imagining this work happening in a new branch and ultimately > being applied only to master, not the stable branch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing
[ https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235783#comment-17235783 ] Ilan Ginzburg commented on SOLR-14788: -- I believe with a central {{/clusterstate.json}}, having a central server batch updates made sense. Each separate node (or thread) trying to do its own direct update to a shared (ZooKeeper) file likely creates too much contention. I believe that in its original context, Overseer does make sense. Now that state is distributed per collection ({{/state.json}}), it's possible to start thinking about distributing these updates without the help of a central server. The high number of replica state updates when a node goes up or down (which I suspect is the main reason the Overseer cluster state change ZooKeeper queue saturates) can likely be greatly reduced by considering a replica as down if its node is down, regardless of the state the replica last broadcast. > Solr: The Next Big Thing > > > Key: SOLR-14788 > URL: https://issues.apache.org/jira/browse/SOLR-14788 > Project: Solr > Issue Type: Task >Reporter: Mark Robert Miller >Assignee: Mark Robert Miller >Priority: Critical > > h3. > [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The > Policeman is on duty!*{color} > {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and > have some fun. Try to make some progress. Don't stress too much about the > impact of your changes or maintaining stability and performance and > correctness so much. Until the end of phase 1, I've got your back. I have a > variety of tools and contraptions I have been building over the years and I > will continue training them on this branch. I will review your changes and > peer out across the land and course correct where needed. As Mike D will be > thinking, "Sounds like a bottleneck Mark." And indeed it will be to some > extent. Which is why once stage one is completed, I will flip The Policeman > to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} > *down for some vigilante justice, but I won't be walking the beat, all that > stuff about sit back and relax goes out the window.*{color}_ > {quote} > > I have stolen this title from Ishan or Noble and Ishan. > This issue is meant to capture the work of a small team that is forming to > push Solr and SolrCloud to the next phase. > I have kicked off the work with an effort to create a very fast and solid > base. That work is not 100% done, but it's ready to join the fight. > Tim Potter has started giving me a tremendous hand in finishing up. Ishan and > Noble have already contributed support and testing and have plans for > additional work to shore up some of our current shortcomings. > Others have expressed an interest in helping and hopefully they will pop up > here as well. > Let's organize and discuss our efforts here and in various sub issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9616) Improve test coverage for internal format versions
[ https://issues.apache.org/jira/browse/LUCENE-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235801#comment-17235801 ] Robert Muir commented on LUCENE-9616: - +1 to more aggressively copy-on-write the format classes changing the underlying file format: and try to only use some internal versioning for truly minor/bugfix changes? I think internal formats only had that use-case in mind, and old versions should not all be tested because this way because they are buggy. it should be possible to fix some bad bugs in the codec (in a backwards compatible way), yet not be annoyed by backwards tests for the rest of a major release. > Improve test coverage for internal format versions > -- > > Key: LUCENE-9616 > URL: https://issues.apache.org/jira/browse/LUCENE-9616 > Project: Lucene - Core > Issue Type: Test >Reporter: Julie Tibshirani >Priority: Minor > > Some formats use an internal versioning system -- for example > {{CompressingStoredFieldsFormat}} maintains older logic for reading an > on-heap fields index. Because we always allow reading segments from the > current + previous major version, some users still rely on the read-side > logic of older internal versions. > Although the older version logic is covered by > {{TestBackwardsCompatibility}}, it looks like it's not exercised in unit > tests. Older versions aren't "in rotation" when choosing a random codec for > tests. They also don't have dedicated unit tests as we have for separate > older formats, for example {{TestLucene60PointsFormat}}. > It could be good to improve unit test coverage for the older versions, since > they're in active use. A downside is that it's not straightforward to add > unit tests, since we tend to just change/ delete the old write-side logic as > we bump internal versions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing
[ https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235852#comment-17235852 ] Mark Robert Miller commented on SOLR-14788: --- My initial implementation only really focused on a single collection - even that was far, far from completed. Now I was not involved in Overseer implementation, but it was not introduced to batch updates, to one state.json or several - at least nothing like what it was doing. If that was even the drive for it (it wasn't from my memory and knowledge) it would have been very silly to try and handle our absurd state.json update load via an Overseer node before making all the other nodes try and behave even remotely sane. That Overseer did nothing by batching between collections - that was like adding a bucket of water to a fire truck that is out of water. That is mostly what has happened unfortunately - bandaids and workarounds. I never implemented the SolrCloud Yonik and I worked out. I started it. We had the design. I put a foot in that direction. Since then, things have mostly gone down that foot hole instead of forward. Likey, as was the case for me, for many, it was not their job to finish implementing SolrCloud, it was a huge task, few understood what the actual design was, and you could do quite well riding on what was there for little effort vs a lot of effort and who knows where you end up. The Overseer as implemented was not in line with the design. This is an event driven design. A light weight, low cost, simple design. Building it on an existing and non Cloud oriented design made it very difficult to decipher what the plan actually was or even how/if you could get there on these building blocks while keeping them stable and active and non cloud mode, etc. So when I talk about the benefits the Overseer type nodes can bring, they hardly apply to master. It's a common problem I've run into. I'll talk about how slow something is, or how much better things can be if do X, and someone might take a little look and come back with, meh, didn't seem like what you were saying to me. And often, there are so many layers that you can't see much benefit or any when you play around with some isolated change in the current world. 10 other things will eat you first. Anyway, the system started by distributing updates without the help of a central server :) The Overseer was not created to deal with clusterstate.json, because we did not have state.json, that would be crazy :) It literally serves no practical purpose at this point, other than a huge amount of problems and slowness and bad behavior. Now, I'm excited for any competition on what direction to go here. Don't take any of this negatively. If your CAS system can run the gauntlet, I'll congratulate you and be thankful. But your responses and the details in the remove the Overseer issue seem (as is common enough) overly caught up in the current nonsensical SolrCloud world. I wish you the best of luck making this system and what it does and supports hum without a central server(s). It was what I tried to keep in the design at the start. But it loses when you run the mind simulations and ignore the current SolrCloud baggage and it almost certainly loses when you implement it. You will have to shoot for the moon though, not the current Overseer implementation, because my challenger is almost to the ring and is in a different weight class / league / world than what you have evaluated in 8x/master. > Solr: The Next Big Thing > > > Key: SOLR-14788 > URL: https://issues.apache.org/jira/browse/SOLR-14788 > Project: Solr > Issue Type: Task >Reporter: Mark Robert Miller >Assignee: Mark Robert Miller >Priority: Critical > > h3. > [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The > Policeman is on duty!*{color} > {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and > have some fun. Try to make some progress. Don't stress too much about the > impact of your changes or maintaining stability and performance and > correctness so much. Until the end of phase 1, I've got your back. I have a > variety of tools and contraptions I have been building over the years and I > will continue training them on this branch. I will review your changes and > peer out across the land and course correct where needed. As Mike D will be > thinking, "Sounds like a bottleneck Mark." And indeed it will be to some > extent. Which is why once stage one is completed, I will flip The Policeman > to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} > *down for some vigilante justice, but I won't be walking the beat, all that > stuff about si
[GitHub] [lucene-solr] jtibshirani merged pull request #2084: LUCENE-9592: Loosen equality checks in TestVectorUtil.
jtibshirani merged pull request #2084: URL: https://github.com/apache/lucene-solr/pull/2084 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9592) TestVectorUtil can fail with assertion error
[ https://issues.apache.org/jira/browse/LUCENE-9592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235858#comment-17235858 ] ASF subversion and git services commented on LUCENE-9592: - Commit 8c7b709c08662d396bd12b1e352db99bb489a7da in lucene-solr's branch refs/heads/master from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8c7b709 ] LUCENE-9592: Loosen equality checks in TestVectorUtil. (#2084) TestVectorUtil occasionally fails because of floating point errors. This change slightly increases the epsilon in equality checks -- testing shows that this will greatly decrease the chance of failure. > TestVectorUtil can fail with assertion error > > > Key: LUCENE-9592 > URL: https://issues.apache.org/jira/browse/LUCENE-9592 > Project: Lucene - Core > Issue Type: Test >Reporter: Julie Tibshirani >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > Example failure: > {code:java} > java.lang.AssertionError: expected:<35.699527740478516> but > was:<35.69953918457031>java.lang.AssertionError: > expected:<35.699527740478516> but was:<35.69953918457031> at > __randomizedtesting.SeedInfo.seed([305701410F76FAD0:4797D77886281D68]:0) at > org.junit.Assert.fail(Assert.java:89) at > org.junit.Assert.failNotEquals(Assert.java:835) at > org.junit.Assert.assertEquals(Assert.java:555) at > org.junit.Assert.assertEquals(Assert.java:685) at > org.apache.lucene.util.TestVectorUtil.testSelfDotProduct(TestVectorUtil.java:28) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:567){code} > Reproduce line: > {code:java} > gradlew test --tests TestVectorUtil.testSelfDotProduct > -Dtests.seed=305701410F76FAD0 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-AE -Dtests.timezone=SystemV/MST7 -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > Perhaps the vector utility methods should work with doubles instead of floats > to avoid loss of precision. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9592) TestVectorUtil can fail with assertion error
[ https://issues.apache.org/jira/browse/LUCENE-9592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani resolved LUCENE-9592. -- Resolution: Fixed > TestVectorUtil can fail with assertion error > > > Key: LUCENE-9592 > URL: https://issues.apache.org/jira/browse/LUCENE-9592 > Project: Lucene - Core > Issue Type: Test >Reporter: Julie Tibshirani >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > Example failure: > {code:java} > java.lang.AssertionError: expected:<35.699527740478516> but > was:<35.69953918457031>java.lang.AssertionError: > expected:<35.699527740478516> but was:<35.69953918457031> at > __randomizedtesting.SeedInfo.seed([305701410F76FAD0:4797D77886281D68]:0) at > org.junit.Assert.fail(Assert.java:89) at > org.junit.Assert.failNotEquals(Assert.java:835) at > org.junit.Assert.assertEquals(Assert.java:555) at > org.junit.Assert.assertEquals(Assert.java:685) at > org.apache.lucene.util.TestVectorUtil.testSelfDotProduct(TestVectorUtil.java:28) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:567){code} > Reproduce line: > {code:java} > gradlew test --tests TestVectorUtil.testSelfDotProduct > -Dtests.seed=305701410F76FAD0 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-AE -Dtests.timezone=SystemV/MST7 -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > Perhaps the vector utility methods should work with doubles instead of floats > to avoid loss of precision. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] zacharymorn commented on a change in pull request #2052: LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory
zacharymorn commented on a change in pull request #2052: URL: https://github.com/apache/lucene-solr/pull/2052#discussion_r527373046 ## File path: lucene/misc/src/test/org/apache/lucene/misc/store/TestDirectIODirectory.java ## @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import com.carrotsearch.randomizedtesting.LifecycleScope; +import com.carrotsearch.randomizedtesting.RandomizedTest; +import org.apache.lucene.store.*; + +import java.io.IOException; +import java.nio.file.Files; +import java.nio.file.Path; + +import static org.apache.lucene.misc.store.DirectIODirectory.DEFAULT_MIN_BYTES_DIRECT; + +public class TestDirectIODirectory extends BaseDirectoryTestCase { + public void testWriteReadWithDirectIO() throws IOException { +try(Directory dir = getDirectory(RandomizedTest.newTempDir(LifecycleScope.TEST))) { + final long blockSize = Files.getFileStore(createTempFile()).getBlockSize(); + final long minBytesDirect = Double.valueOf(Math.ceil(DEFAULT_MIN_BYTES_DIRECT / blockSize)).longValue() * +blockSize; + // Need to worry about overflows here? + final int writtenByteLength = Math.toIntExact(minBytesDirect); + + MergeInfo mergeInfo = new MergeInfo(1000, Integer.MAX_VALUE, true, 1); + final IOContext context = new IOContext(mergeInfo); + + IndexOutput indexOutput = dir.createOutput("test", context); + indexOutput.writeBytes(new byte[writtenByteLength], 0, writtenByteLength); + IndexInput indexInput = dir.openInput("test", context); + + assertEquals("The length of bytes read should equal to written", writtenByteLength, indexInput.length()); + + indexOutput.close(); + indexInput.close(); +} + } + + @Override + protected Directory getDirectory(Path path) throws IOException { +Directory delegate = FSDirectory.open(path); Review comment: I've figured it out. Looks like more methods in `DirectIODirectory` need to be delegated. Could you please take a look at the latest commit, and let me know if if looks good? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing
[ https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235952#comment-17235952 ] Mark Robert Miller commented on SOLR-14788: --- And let me just say again, I don’t mean to offend me n anything in there. It’s looks to me like you came in and looked things over and also basically said “this overseer has no practical benefit, let’s rip it.” That’s intelligent, that’s outside agitation, +1. Our move from CAS to the Overseer was a huge loss in the position we were, given introducing an unnecessary layer completely for unrealized future pipe dreams. If you come in and look at that thing and say WTF, my hats off to you. > Solr: The Next Big Thing > > > Key: SOLR-14788 > URL: https://issues.apache.org/jira/browse/SOLR-14788 > Project: Solr > Issue Type: Task >Reporter: Mark Robert Miller >Assignee: Mark Robert Miller >Priority: Critical > > h3. > [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The > Policeman is on duty!*{color} > {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and > have some fun. Try to make some progress. Don't stress too much about the > impact of your changes or maintaining stability and performance and > correctness so much. Until the end of phase 1, I've got your back. I have a > variety of tools and contraptions I have been building over the years and I > will continue training them on this branch. I will review your changes and > peer out across the land and course correct where needed. As Mike D will be > thinking, "Sounds like a bottleneck Mark." And indeed it will be to some > extent. Which is why once stage one is completed, I will flip The Policeman > to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} > *down for some vigilante justice, but I won't be walking the beat, all that > stuff about sit back and relax goes out the window.*{color}_ > {quote} > > I have stolen this title from Ishan or Noble and Ishan. > This issue is meant to capture the work of a small team that is forming to > push Solr and SolrCloud to the next phase. > I have kicked off the work with an effort to create a very fast and solid > base. That work is not 100% done, but it's ready to join the fight. > Tim Potter has started giving me a tremendous hand in finishing up. Ishan and > Noble have already contributed support and testing and have plans for > additional work to shore up some of our current shortcomings. > Others have expressed an interest in helping and hopefully they will pop up > here as well. > Let's organize and discuss our efforts here and in various sub issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-14788) Solr: The Next Big Thing
[ https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235952#comment-17235952 ] Mark Robert Miller edited comment on SOLR-14788 at 11/20/20, 7:26 AM: -- And let me just say again, I don’t mean to offend you in anything in there. It’s looks to me like you came in and looked things over and also basically said “this overseer has no practical benefit, let’s rip it.” That’s intelligent, that’s outside agitation, +1. Our move from CAS to the Overseer was a huge loss in the position we were in, introducing an unnecessary layer completely for unrealized future pipe dreams. If you come in and look at that thing and say WTF, my hats off to you. was (Author: markrmiller): And let me just say again, I don’t mean to offend me n anything in there. It’s looks to me like you came in and looked things over and also basically said “this overseer has no practical benefit, let’s rip it.” That’s intelligent, that’s outside agitation, +1. Our move from CAS to the Overseer was a huge loss in the position we were, given introducing an unnecessary layer completely for unrealized future pipe dreams. If you come in and look at that thing and say WTF, my hats off to you. > Solr: The Next Big Thing > > > Key: SOLR-14788 > URL: https://issues.apache.org/jira/browse/SOLR-14788 > Project: Solr > Issue Type: Task >Reporter: Mark Robert Miller >Assignee: Mark Robert Miller >Priority: Critical > > h3. > [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The > Policeman is on duty!*{color} > {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and > have some fun. Try to make some progress. Don't stress too much about the > impact of your changes or maintaining stability and performance and > correctness so much. Until the end of phase 1, I've got your back. I have a > variety of tools and contraptions I have been building over the years and I > will continue training them on this branch. I will review your changes and > peer out across the land and course correct where needed. As Mike D will be > thinking, "Sounds like a bottleneck Mark." And indeed it will be to some > extent. Which is why once stage one is completed, I will flip The Policeman > to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} > *down for some vigilante justice, but I won't be walking the beat, all that > stuff about sit back and relax goes out the window.*{color}_ > {quote} > > I have stolen this title from Ishan or Noble and Ishan. > This issue is meant to capture the work of a small team that is forming to > push Solr and SolrCloud to the next phase. > I have kicked off the work with an effort to create a very fast and solid > base. That work is not 100% done, but it's ready to join the fight. > Tim Potter has started giving me a tremendous hand in finishing up. Ishan and > Noble have already contributed support and testing and have plans for > additional work to shore up some of our current shortcomings. > Others have expressed an interest in helping and hopefully they will pop up > here as well. > Let's organize and discuss our efforts here and in various sub issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-15008) Avoid building OrdinalMap for each facet
[ https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radu Gheorghe updated SOLR-15008: - Attachment: writes_commits.png > Avoid building OrdinalMap for each facet > > > Key: SOLR-15008 > URL: https://issues.apache.org/jira/browse/SOLR-15008 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module >Affects Versions: 8.7 >Reporter: Radu Gheorghe >Priority: Major > Labels: performance > Attachments: Screenshot 2020-11-19 at 12.01.55.png, writes_commits.png > > > I'm running against the following scenario: > * [JSON] faceting on a high cardinality field > * few matching documents => few unique values > Yet the query almost always takes a long time. Here's an example taking > almost 4s for ~300 documents and unique values (edited a bit): > > {code:java} > "QTime":3869, > "params":{ > "json":"{\"query\": \"*:*\", > \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", > \"unique_id:49866\"] > \"facet\": > {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}", > "rows":"0"}}, > > "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[] > }, > "facets":{ > "count":333, > "keywords":{ > "buckets":[{ > "val":"value1", > "count":124}, > ... > {code} > I did some [profiling with our Sematext > Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it > points me to OrdinalMap building (see attached screenshot). If I read the > code right, an OrdinalMap is built with every facet. And it's expensive since > there are many unique values in the shard (previously, there we more smaller > shards, making latency better, but this approach doesn't scale for this > particular use-case). > If I'm right up to this point, I see a couple of potential improvements, > [inspired from > Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]: > # *Keep the OrdinalMap cached until the next softCommit*, so that only the > first query takes the penalty > # *Allow faceting on actual values (a Map) rather than ordinals*, for > situations like the one above where we have few matching documents. We could > potentially auto-detect this scenario (e.g. by configuring a threshold) and > use a Map when there are few documents > I'm curious about what you're thinking: > * would a PR/patch be welcome for any of the two ideas above? > * do you see better options? am I missing something? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org