[jira] [Commented] (LUCENE-9286) FST construction explodes memory in BitTable
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077913#comment-17077913 ] Dawid Weiss commented on LUCENE-9286: - Thanks Bruno. I'll review and check with our code as well. > FST construction explodes memory in BitTable > > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 10m > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] s1monw commented on a change in pull request #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly
s1monw commented on a change in pull request #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly URL: https://github.com/apache/lucene-solr/pull/1397#discussion_r405325601 ## File path: lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java ## @@ -322,35 +316,27 @@ synchronized Closeable lockAndAbortAll() throws IOException { } /** Returns how many documents were aborted. */ - private int abortThreadState(final ThreadState perThread) throws IOException { + private int abortDocumentsWriterPerThread(final DocumentsWriterPerThread perThread) throws IOException { assert perThread.isHeldByCurrentThread(); -if (perThread.isInitialized()) { - try { -int abortedDocCount = perThread.dwpt.getNumDocsInRAM(); -subtractFlushedNumDocs(abortedDocCount); -perThread.dwpt.abort(); -return abortedDocCount; - } finally { -flushControl.doOnAbort(perThread); - } -} else { +try { + int abortedDocCount = perThread.getNumDocsInRAM(); + subtractFlushedNumDocs(abortedDocCount); + perThread.abort(); + return abortedDocCount; +} finally { flushControl.doOnAbort(perThread); - // This DWPT was never initialized so it has no indexed documents: - return 0; } } /** returns the maximum sequence number for all previously completed operations */ public long getMaxCompletedSequenceNumber() { -long value = lastSeqNo; -int limit = perThreadPool.getMaxThreadStates(); -for(int i = 0; i < limit; i++) { - ThreadState perThread = perThreadPool.getThreadState(i); - value = Math.max(value, perThread.lastSeqNo); -} -return value; +// NOCOMMIT: speak to mikemccandless about this change https://github.com/apache/lucene-solr/commit/5a03216/ +// Returning the last seqNum is as good as the way we had before IMO. I tried to figure out why this is better but +// failed. Review comment: cool thanks for clarifying This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly
s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-610808420 > I'll look again -- in the meantime I beasted all Lucene (core + modules) tests and hit this failure. It does not reproduce, and I doubt it's related to / caused by this change, since it looks like it's a file ref counting issue on exception: yeah it doesn't look related. I will still look into it unless @dnhatn beats me This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14394) Many request handlers should be startup=lazy
[ https://issues.apache.org/jira/browse/SOLR-14394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077933#comment-17077933 ] Andrzej Bialecki commented on SOLR-14394: - I think this is generally a good idea, just keep in mind that the lazy init needs access to the parent SolrMetricsContext and the handler's scope in order to properly construct metrics names. > Many request handlers should be startup=lazy > > > Key: SOLR-14394 > URL: https://issues.apache.org/jira/browse/SOLR-14394 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: David Smiley >Priority: Minor > > Solr cores track a lot of granular metric information per request handler. > This is noise for a request handler that we aren't even using! The vast > majority of request handlers are not used in a typical app. That doesn't > mean they shouldn't be available for use, I'm only saying that they aren't > until perhaps for diagnostic purposes or something. *I think it would be > better to not track metrics on something until we actually use it.* For a > request handler, this is done by setting startup=lazy. Most implicitly > registered plugins should have this set in ImplicitPlugins.json. > CC [~ab] [~noble.paul] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (W
CaoManhDat commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#discussion_r405356167 ## File path: solr/core/src/java/org/apache/solr/util/numeric/IntFloatDynamicMap.java ## @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.numeric; + +import java.util.Arrays; + +import com.carrotsearch.hppc.IntFloatHashMap; +import com.carrotsearch.hppc.cursors.FloatCursor; +import com.carrotsearch.hppc.procedures.IntFloatProcedure; +import org.apache.lucene.util.ArrayUtil; + +import static org.apache.solr.util.numeric.DynamicMap.mapExpectedElements; +import static org.apache.solr.util.numeric.DynamicMap.threshold; +import static org.apache.solr.util.numeric.DynamicMap.useArrayBased; + +public class IntFloatDynamicMap { + private int maxSize; + private IntFloatHashMap hashMap; + private float[] keyValues; + private float emptyValue; + private int threshold; + + public IntFloatDynamicMap(int expectedMaxSize, float emptyValue) { +this.threshold = threshold(expectedMaxSize); +this.maxSize = expectedMaxSize; +this.emptyValue = emptyValue; +if (useArrayBased(expectedMaxSize)) { + upgradeToArray(); +} else { + this.hashMap = new IntFloatHashMap(mapExpectedElements(expectedMaxSize)); +} + } + + private void upgradeToArray() { +keyValues = new float[maxSize]; +if (emptyValue != 0.0f) { + Arrays.fill(keyValues, emptyValue); +} +if (hashMap != null) { + hashMap.forEach((IntFloatProcedure) (key, value) -> keyValues[key] = value); + hashMap = null; +} + } + + private void growBuffer(int minSize) { +assert keyValues != null; +int size = keyValues.length; +keyValues = ArrayUtil.grow(keyValues, minSize); +if (emptyValue != 0.0f) { + for (int i = size; i < keyValues.length; i++) { +keyValues[i] = emptyValue; + } +} + } + + + public void set(int key, float value) { +if (keyValues != null) { + if (key >= keyValues.length) { +growBuffer(key + 1); + } + keyValues[key] = value; +} else { + this.hashMap.put(key, value); + this.maxSize = Math.max(key, maxSize); + if (this.hashMap.size() > threshold) { Review comment: why it is safer? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (W
CaoManhDat commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#discussion_r405355937 ## File path: solr/core/src/java/org/apache/solr/util/numeric/DynamicMap.java ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.numeric; + +import com.carrotsearch.hppc.HashContainers; + +public interface DynamicMap { + static boolean useArrayBased(int expectedMaxSize) { +// for small size, prefer using array based +return expectedMaxSize < (1 << 12); + } + + static int threshold(int expectedMaxSize) { +return expectedMaxSize >>> 6; + } + + static int mapExpectedElements(int expectedMaxSize) { +return (int) (threshold(expectedMaxSize) / HashContainers.DEFAULT_LOAD_FACTOR); Review comment: why it is safer? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9309) IW#addIndices(CodecReader) might delete files concurrently to IW#rollback
Simon Willnauer created LUCENE-9309: --- Summary: IW#addIndices(CodecReader) might delete files concurrently to IW#rollback Key: LUCENE-9309 URL: https://issues.apache.org/jira/browse/LUCENE-9309 Project: Lucene - Core Issue Type: Bug Reporter: Simon Willnauer During work on LUCENE-9304 [~mikemccand] ran into a failure: {noformat} org.apache.lucene.index.TestAddIndexes > test suite's output saved to /home/mike/src/simon/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestAddIndexes.txt, copied below: > java.nio.file.NoSuchFileException: _gt_Lucene85FieldsIndex-doc_ids_6u.tmp > at __randomizedtesting.SeedInfo.seed([4760FA81FBD4B2CE:A147156E5F7BD9B0]:0) > at org.apache.lucene.store.ByteBuffersDirectory.deleteFile(ByteBuffersDirectory.java:148) > at org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:607) > at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38) > at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:696) > at org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:690) > at org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:449) > at org.apache.lucene.index.IndexWriter.rollbackInternalNoCommit(IndexWriter.java:2334) > at org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2275) > at org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2268) > at org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback(TestAddIndexes.java:974) 2> NOTE: reproduce with: ant test -Dtestcase=TestAddIndexes -Dtests.method=testAddIndexesWithRollback -Dtests.seed=4760FA81FBD4B2CE -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=fr-GP -Dtests.t\ imezone=Asia/Tbilisi -Dtests.asserts=true -Dtests.file.encoding=UTF-8 2> NOTE: test params are: codec=Asserting(Lucene84): {c=PostingsFormat(name=LuceneFixedGap), id=PostingsFormat(name=LuceneFixedGap), f1=PostingsFormat(name=LuceneFixedGap), f2=BlockTreeOrds(blocksize=128)\ , version=BlockTreeOrds(blocksize=128), content=FST50}, docValues:{dv=DocValuesFormat(name=Lucene80), soft_delete=DocValuesFormat(name=Lucene80), doc=DocValuesFormat(name=Lucene80), id=DocValuesFormat(name=\ Asserting), content=DocValuesFormat(name=Asserting), doc2d=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=982, maxMBSortInHeap=5.837219998050092, sim=Asserting(org.apache.lucene.search.similarities.As\ sertingSimilarity@6ce38471), locale=fr-GP, timezone=Asia/Tbilisi {noformat} While this unfortunately doesn't reproduce it's likely a bug that exists for quite some time but never showed up until LUCENE-9147 which uses a temporary output. That's fine but with IW#addIndices(CodecReader...) not registering the merge it does in the IW we never wait for the merge to finish while rollback and if that merge finishes concurrently it will also remove these .tmp files. There are many ways to fix this and I can work on a patch, but hey do we really need to be able to add indices while we index and do that on an open and live IW or can it be a tool on top of it? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly
s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-610867265 I opened [LUCENE-9309](https://issues.apache.org/jira/browse/LUCENE-9309) for this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.
dweiss commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes. URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405405465 ## File path: lucene/core/src/java/org/apache/lucene/util/fst/BitTableUtil.java ## @@ -0,0 +1,200 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.fst; + +import java.io.IOException; + +/** + * Static helper methods for {@link FST.Arc.BitTable}. + * + * @lucene.experimental + */ +class BitTableUtil { + + /** + * Returns whether the bit at given zero-based index is set. + * Example: bitIndex 10 means the third bit on the right of the second byte. + * + * @param bitIndex The bit zero-based index. It must be greater than or equal to 0, and strictly less than + * {@code number of bit-table bytes * Byte.SIZE}. + * @param reader The {@link FST.BytesReader} to read. It must be positioned at the beginning of the bit-table. + */ + static boolean isBitSet(int bitIndex, FST.BytesReader reader) throws IOException { +assert bitIndex >= 0 : "bitIndex=" + bitIndex; +reader.skipBytes(bitIndex >> 3); +return (readByte(reader) & (1L << (bitIndex & (Byte.SIZE - 1 != 0; + } + + + /** + * Counts all bits set in the bit-table. + * + * @param bitTableBytes The number of bytes in the bit-table. + * @param readerThe {@link FST.BytesReader} to read. It must be positioned at the beginning of the bit-table. + */ + static int countBits(int bitTableBytes, FST.BytesReader reader) throws IOException { +assert bitTableBytes >= 0 : "bitTableBytes=" + bitTableBytes; Review comment: Just wondering - since BytesReader is a DataInput wouldn't it be more efficient to entire longs + the remainder? I also just noticed read8bytes - this is effectively the same as reader.readLong for bitcounts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.
dweiss commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes. URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405422364 ## File path: lucene/core/src/java/org/apache/lucene/util/fst/FSTEnum.java ## @@ -178,7 +179,7 @@ protected void doSeekCeil() throws IOException { } else { if (targetIndex < 0) { targetIndex = -1; - } else if (arc.bitTable().isBitSet(targetIndex)) { Review comment: This is just a side note but it's code like this one that would really benefit from some kind of arc iterator abstraction. :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.
dweiss commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes. URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405421327 ## File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java ## @@ -195,10 +209,9 @@ posArcsStart = other.posArcsStart(); arcIdx = other.arcIdx(); numArcs = other.numArcs(); -if (nodeFlags() == ARCS_FOR_DIRECT_ADDRESSING) { - bitTable = other.bitTable() == null ? null : other.bitTable().copy(); - firstLabel = other.firstLabel(); -} +bitTableStart = other.bitTableStart; Review comment: I don't see the full code in the diff but I believe all fields should be copied (or cleared, depending on the condition) from other. It may be a bit of an overhead but it'd keep internal state of arcs consistent (and debugging easier). I recall this code was copying just the required fields depending on nodeFlags() - it should clear or copy everything, really. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-9286: Summary: FST arc.copyOf clones BitTables and this can lead to excessive memory use (was: FST construction explodes memory in BitTable) > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 0.5h > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9266) ant nightly-smoke fails due to presence of build.gradle
[ https://issues.apache.org/jira/browse/LUCENE-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078061#comment-17078061 ] ASF subversion and git services commented on LUCENE-9266: - Commit 793a3becfbde09d1d48c6aa2d189ea677fca7ac2 in lucene-solr's branch refs/heads/master from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=793a3be ] LUCENE-9266: correct windows gradle wrapper download script - wrong placement of the quote. > ant nightly-smoke fails due to presence of build.gradle > --- > > Key: LUCENE-9266 > URL: https://issues.apache.org/jira/browse/LUCENE-9266 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Fix For: master (9.0) > > Time Spent: 6.5h > Remaining Estimate: 0h > > Seen on Jenkins - > [https://builds.apache.org/job/Lucene-Solr-SmokeRelease-master/1617/console] > > Reproduced locally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9309) IW#addIndices(CodecReader) might delete files concurrently to IW#rollback
[ https://issues.apache.org/jira/browse/LUCENE-9309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078070#comment-17078070 ] Uwe Schindler commented on LUCENE-9309: --- I do this quite often with codec reader. I would be fine to somehow prevent concurrent indexing, but leave that api available. No tool, please. > IW#addIndices(CodecReader) might delete files concurrently to IW#rollback > - > > Key: LUCENE-9309 > URL: https://issues.apache.org/jira/browse/LUCENE-9309 > Project: Lucene - Core > Issue Type: Bug >Reporter: Simon Willnauer >Priority: Major > > During work on LUCENE-9304 [~mikemccand] ran into a failure: > {noformat} > org.apache.lucene.index.TestAddIndexes > test suite's output saved to > /home/mike/src/simon/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestAddIndexes.txt, > copied below: >> java.nio.file.NoSuchFileException: > _gt_Lucene85FieldsIndex-doc_ids_6u.tmp >> at > __randomizedtesting.SeedInfo.seed([4760FA81FBD4B2CE:A147156E5F7BD9B0]:0) >> at > org.apache.lucene.store.ByteBuffersDirectory.deleteFile(ByteBuffersDirectory.java:148) >> at > org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:607) >> at > org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38) >> at > org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:696) >> at > org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:690) >> at > org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:449) >> at > org.apache.lucene.index.IndexWriter.rollbackInternalNoCommit(IndexWriter.java:2334) >> at > org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2275) >> at > org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2268) >> at > org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback(TestAddIndexes.java:974) > 2> NOTE: reproduce with: ant test -Dtestcase=TestAddIndexes > -Dtests.method=testAddIndexesWithRollback -Dtests.seed=4760FA81FBD4B2CE > -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=fr-GP -Dtests.t\ > imezone=Asia/Tbilisi -Dtests.asserts=true -Dtests.file.encoding=UTF-8 > 2> NOTE: test params are: codec=Asserting(Lucene84): > {c=PostingsFormat(name=LuceneFixedGap), > id=PostingsFormat(name=LuceneFixedGap), > f1=PostingsFormat(name=LuceneFixedGap), f2=BlockTreeOrds(blocksize=128)\ > , version=BlockTreeOrds(blocksize=128), content=FST50}, > docValues:{dv=DocValuesFormat(name=Lucene80), > soft_delete=DocValuesFormat(name=Lucene80), > doc=DocValuesFormat(name=Lucene80), id=DocValuesFormat(name=\ > Asserting), content=DocValuesFormat(name=Asserting), > doc2d=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=982, > maxMBSortInHeap=5.837219998050092, > sim=Asserting(org.apache.lucene.search.similarities.As\ > sertingSimilarity@6ce38471), locale=fr-GP, timezone=Asia/Tbilisi > {noformat} > While this unfortunately doesn't reproduce it's likely a bug that exists for > quite some time but never showed up until LUCENE-9147 which uses a temporary > output. That's fine but with IW#addIndices(CodecReader...) not registering > the merge it does in the IW we never wait for the merge to finish while > rollback and if that merge finishes concurrently it will also remove these > .tmp files. > There are many ways to fix this and I can work on a patch, but hey do we > really need to be able to add indices while we index and do that on an open > and live IW or can it be a tool on top of it? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078075#comment-17078075 ] Dawid Weiss commented on LUCENE-9286: - I love the patch and the idea, Bruno. But the cost of construction and traversal have gone high up for me. On a slower-ish dell laptop I get the following on master (same code as in that tiny repo before): {code} FST construction (of=0.8) 2s 299ms 9.4%11s TermEnum scan (of=0.8) 391ms 1.6%14s FST construction (of=1.0) 5s 24.0%14s TermEnum scan (of=1.0)3s 938ms 16.1%20s {code} whereas after the patch compilation and enumeration goes up to a whopping 35+ seconds! {code} FST construction (of=0.8) 2s 357ms 2.6%13s TermEnum scan (of=0.8) 457ms 0.5%16s FST construction (of=1.0) 37s 41.8%16s TermEnum scan (of=1.0) 35s 39.6%54s {code} This is a bit strange, isn't it? I don't think I've changed anything regarding fst parameters: https://github.com/dweiss/lucene9286/commit/1fb899e018712e9637984d95937c50b4bc9ffa97#diff-32e7e4311056421eef9f3a87a4ec51f7R56-R66 I'm not sure where the difference comes from and why it's so slow -- I'll try to get an execution profile later today. > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 0.5h > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078075#comment-17078075 ] Dawid Weiss edited comment on LUCENE-9286 at 4/8/20, 11:08 AM: --- I love the patch and the idea, Bruno. But the cost of construction and traversal have gone high up for me. On a slower-ish dell laptop I get the following on master (same code as in that tiny repo before): {code} FST construction (of=0.8) 2s 299ms TermEnum scan (of=0.8) 391ms FST construction (of=1.0) 5s TermEnum scan (of=1.0)3s 938ms {code} whereas after the patch compilation and enumeration goes up to a whopping 35+ seconds! {code} FST construction (of=0.8) 2s 357ms TermEnum scan (of=0.8) 457ms FST construction (of=1.0) 37s TermEnum scan (of=1.0) 35s {code} This is a bit strange, isn't it? I don't think I've changed anything regarding fst parameters: https://github.com/dweiss/lucene9286/commit/1fb899e018712e9637984d95937c50b4bc9ffa97#diff-32e7e4311056421eef9f3a87a4ec51f7R56-R66 I'm not sure where the difference comes from and why it's so slow -- I'll try to get an execution profile later today. was (Author: dweiss): I love the patch and the idea, Bruno. But the cost of construction and traversal have gone high up for me. On a slower-ish dell laptop I get the following on master (same code as in that tiny repo before): {code} FST construction (of=0.8) 2s 299ms 9.4%11s TermEnum scan (of=0.8) 391ms 1.6%14s FST construction (of=1.0) 5s 24.0%14s TermEnum scan (of=1.0)3s 938ms 16.1%20s {code} whereas after the patch compilation and enumeration goes up to a whopping 35+ seconds! {code} FST construction (of=0.8) 2s 357ms 2.6%13s TermEnum scan (of=0.8) 457ms 0.5%16s FST construction (of=1.0) 37s 41.8%16s TermEnum scan (of=1.0) 35s 39.6%54s {code} This is a bit strange, isn't it? I don't think I've changed anything regarding fst parameters: https://github.com/dweiss/lucene9286/commit/1fb899e018712e9637984d95937c50b4bc9ffa97#diff-32e7e4311056421eef9f3a87a4ec51f7R56-R66 I'm not sure where the difference comes from and why it's so slow -- I'll try to get an execution profile later today. > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 0.5h > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-14394) Many request handlers should be startup=lazy
[ https://issues.apache.org/jira/browse/SOLR-14394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078080#comment-17078080 ] Noble Paul edited comment on SOLR-14394 at 4/8/20, 11:13 AM: - I wonder if it's worth it to make implicit plugins lazily register. They are extremely low cost request handlers and I don't see the value in doing so However, {{startup=lazy}} should work properly was (Author: noble.paul): I wonder if it's worth it to make them lazily register. They are extremely low cost request handlers and I don't see the value in doing so However, {{startup=lazy}} should work properly > Many request handlers should be startup=lazy > > > Key: SOLR-14394 > URL: https://issues.apache.org/jira/browse/SOLR-14394 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: David Smiley >Priority: Minor > > Solr cores track a lot of granular metric information per request handler. > This is noise for a request handler that we aren't even using! The vast > majority of request handlers are not used in a typical app. That doesn't > mean they shouldn't be available for use, I'm only saying that they aren't > until perhaps for diagnostic purposes or something. *I think it would be > better to not track metrics on something until we actually use it.* For a > request handler, this is done by setting startup=lazy. Most implicitly > registered plugins should have this set in ImplicitPlugins.json. > CC [~ab] [~noble.paul] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14394) Many request handlers should be startup=lazy
[ https://issues.apache.org/jira/browse/SOLR-14394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078080#comment-17078080 ] Noble Paul commented on SOLR-14394: --- I wonder if it's worth it to make them lazily register. They are extremely low cost request handlers and I don't see the value in doing so However, {{startup=lazy}} should work properly > Many request handlers should be startup=lazy > > > Key: SOLR-14394 > URL: https://issues.apache.org/jira/browse/SOLR-14394 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: David Smiley >Priority: Minor > > Solr cores track a lot of granular metric information per request handler. > This is noise for a request handler that we aren't even using! The vast > majority of request handlers are not used in a typical app. That doesn't > mean they shouldn't be available for use, I'm only saying that they aren't > until perhaps for diagnostic purposes or something. *I think it would be > better to not track metrics on something until we actually use it.* For a > request handler, this is done by setting startup=lazy. Most implicitly > registered plugins should have this set in ImplicitPlugins.json. > CC [~ab] [~noble.paul] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique value
bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#discussion_r405406935 ## File path: solr/core/src/java/org/apache/solr/util/numeric/IntFloatDynamicMap.java ## @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.numeric; + +import java.util.Arrays; + +import com.carrotsearch.hppc.IntFloatHashMap; +import com.carrotsearch.hppc.cursors.FloatCursor; +import com.carrotsearch.hppc.procedures.IntFloatProcedure; +import org.apache.lucene.util.ArrayUtil; + +public class IntFloatDynamicMap implements DynamicMap { + private int maxSize; + private IntFloatHashMap hashMap; + private float[] keyValues; + private float emptyValue; + private int threshold; + + /** + * Create map with expected max value of key. + * Although the map will automatically do resizing to be able to hold key >= {@code expectedKeyMax}. + * But putting key much larger than {@code expectedKeyMax} is discourage since it can leads to use LOT OF memory. Review comment: We can be explicit and say: more than key x 4B, so for a max key 100M it may allocate more than 400 MB. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique value
bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#discussion_r405478717 ## File path: solr/core/src/java/org/apache/solr/util/numeric/IntFloatDynamicMap.java ## @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.numeric; + +import java.util.Arrays; + +import com.carrotsearch.hppc.IntFloatHashMap; +import com.carrotsearch.hppc.cursors.FloatCursor; +import com.carrotsearch.hppc.procedures.IntFloatProcedure; +import org.apache.lucene.util.ArrayUtil; + +import static org.apache.solr.util.numeric.DynamicMap.mapExpectedElements; +import static org.apache.solr.util.numeric.DynamicMap.threshold; +import static org.apache.solr.util.numeric.DynamicMap.useArrayBased; + +public class IntFloatDynamicMap { + private int maxSize; + private IntFloatHashMap hashMap; + private float[] keyValues; + private float emptyValue; + private int threshold; + + public IntFloatDynamicMap(int expectedMaxSize, float emptyValue) { +this.threshold = threshold(expectedMaxSize); +this.maxSize = expectedMaxSize; +this.emptyValue = emptyValue; +if (useArrayBased(expectedMaxSize)) { + upgradeToArray(); +} else { + this.hashMap = new IntFloatHashMap(mapExpectedElements(expectedMaxSize)); +} + } + + private void upgradeToArray() { +keyValues = new float[maxSize]; +if (emptyValue != 0.0f) { + Arrays.fill(keyValues, emptyValue); +} +if (hashMap != null) { + hashMap.forEach((IntFloatProcedure) (key, value) -> keyValues[key] = value); + hashMap = null; +} + } + + private void growBuffer(int minSize) { +assert keyValues != null; +int size = keyValues.length; +keyValues = ArrayUtil.grow(keyValues, minSize); +if (emptyValue != 0.0f) { + for (int i = size; i < keyValues.length; i++) { +keyValues[i] = emptyValue; + } +} + } + + + public void set(int key, float value) { +if (keyValues != null) { + if (key >= keyValues.length) { +growBuffer(key + 1); + } + keyValues[key] = value; +} else { + this.hashMap.put(key, value); + this.maxSize = Math.max(key, maxSize); + if (this.hashMap.size() > threshold) { Review comment: Did you test and debug when the DynamicMap upgrades from a map to an array internally? I mean in debug mode step by step here. I think the map first enlarges and rehashes just before the upgrade to an array. Let's take an example with expectedKeyMax = 500K. this.threshold = threshold(expectedKeyMax) = 50/64 = 7812 IntFloatHashMap initial capacity = mapExpectedElements(expectedKeyMax) = (int) (threshold(expectedKeyMax) / 0.75f) = (int) (7812 / 0.75f) = 10416 IntFloatHashMap internal threshold = ceil(initial capacity * 0.75) = ceil(10416 * 0.75) = 7812 Internally the HPPC map enlarges during a put() when its size == 7812 *before* incrementing the size. Here the condition to upgrade to an array triggers when the map size *after* the put is > 7812, so at 7813. So I think the map first enlarges and rehashes just before we upgrade to an array, which would be wasteful. Also, the map internal threshold is ceil(initial capacity * 0.75), but it could be without ceil() for other implementations. To be safe wrt the float rounding, I suggested to add +1 in DynamicMap. mapExpectedElements(int). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique value
bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#discussion_r405403028 ## File path: solr/core/src/java/org/apache/solr/util/numeric/FloatConsumer.java ## @@ -0,0 +1,23 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.numeric; + +@FunctionalInterface +public interface FloatConsumer { Review comment: Javadoc: similar to java.util.function.IntConsumer This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique value
bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#discussion_r405411173 ## File path: solr/core/src/java/org/apache/solr/util/numeric/IntFloatDynamicMap.java ## @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.numeric; + +import java.util.Arrays; + +import com.carrotsearch.hppc.IntFloatHashMap; +import com.carrotsearch.hppc.cursors.FloatCursor; +import com.carrotsearch.hppc.procedures.IntFloatProcedure; +import org.apache.lucene.util.ArrayUtil; + +import static org.apache.solr.util.numeric.DynamicMap.mapExpectedElements; +import static org.apache.solr.util.numeric.DynamicMap.threshold; +import static org.apache.solr.util.numeric.DynamicMap.useArrayBased; + +public class IntFloatDynamicMap { + private int maxSize; + private IntFloatHashMap hashMap; + private float[] keyValues; + private float emptyValue; + private int threshold; + + public IntFloatDynamicMap(int expectedMaxSize, float emptyValue) { +this.threshold = threshold(expectedMaxSize); +this.maxSize = expectedMaxSize; +this.emptyValue = emptyValue; +if (useArrayBased(expectedMaxSize)) { + upgradeToArray(); +} else { + this.hashMap = new IntFloatHashMap(mapExpectedElements(expectedMaxSize)); +} + } + + private void upgradeToArray() { +keyValues = new float[maxSize]; +if (emptyValue != 0.0f) { + Arrays.fill(keyValues, emptyValue); +} +if (hashMap != null) { + hashMap.forEach((IntFloatProcedure) (key, value) -> keyValues[key] = value); + hashMap = null; +} + } + + private void growBuffer(int minSize) { +assert keyValues != null; +int size = keyValues.length; +keyValues = ArrayUtil.grow(keyValues, minSize); +if (emptyValue != 0.0f) { + for (int i = size; i < keyValues.length; i++) { +keyValues[i] = emptyValue; + } +} + } + + + public void set(int key, float value) { +if (keyValues != null) { + if (key >= keyValues.length) { Review comment: The javadoc in the constructor is good, but I'm concerned that this util class may be used elsewhere without clearly reading/understanding the risk. Mainly because the class is named Map, and a Map generally can take any key value. Could we create a second constructor with a third param, something like "hardLimitKeyMax" which would be checked here otherwise throw an IllegalArgumentException? It could have a default limit of 10x expectedKeyMax in the first constructor for example. And it could be increased if someone knows what she is doing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique value
bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#discussion_r405478717 ## File path: solr/core/src/java/org/apache/solr/util/numeric/IntFloatDynamicMap.java ## @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.numeric; + +import java.util.Arrays; + +import com.carrotsearch.hppc.IntFloatHashMap; +import com.carrotsearch.hppc.cursors.FloatCursor; +import com.carrotsearch.hppc.procedures.IntFloatProcedure; +import org.apache.lucene.util.ArrayUtil; + +import static org.apache.solr.util.numeric.DynamicMap.mapExpectedElements; +import static org.apache.solr.util.numeric.DynamicMap.threshold; +import static org.apache.solr.util.numeric.DynamicMap.useArrayBased; + +public class IntFloatDynamicMap { + private int maxSize; + private IntFloatHashMap hashMap; + private float[] keyValues; + private float emptyValue; + private int threshold; + + public IntFloatDynamicMap(int expectedMaxSize, float emptyValue) { +this.threshold = threshold(expectedMaxSize); +this.maxSize = expectedMaxSize; +this.emptyValue = emptyValue; +if (useArrayBased(expectedMaxSize)) { + upgradeToArray(); +} else { + this.hashMap = new IntFloatHashMap(mapExpectedElements(expectedMaxSize)); +} + } + + private void upgradeToArray() { +keyValues = new float[maxSize]; +if (emptyValue != 0.0f) { + Arrays.fill(keyValues, emptyValue); +} +if (hashMap != null) { + hashMap.forEach((IntFloatProcedure) (key, value) -> keyValues[key] = value); + hashMap = null; +} + } + + private void growBuffer(int minSize) { +assert keyValues != null; +int size = keyValues.length; +keyValues = ArrayUtil.grow(keyValues, minSize); +if (emptyValue != 0.0f) { + for (int i = size; i < keyValues.length; i++) { +keyValues[i] = emptyValue; + } +} + } + + + public void set(int key, float value) { +if (keyValues != null) { + if (key >= keyValues.length) { +growBuffer(key + 1); + } + keyValues[key] = value; +} else { + this.hashMap.put(key, value); + this.maxSize = Math.max(key, maxSize); + if (this.hashMap.size() > threshold) { Review comment: Did you test and debug when the DynamicMap upgrades from a map to an array internally? I mean in debug mode step by step here. I think the map first enlarges and rehashes just before the upgrade to an array. Let's take an example with expectedKeyMax = 500K. this.threshold = threshold(expectedKeyMax) = 50/64 = 7812 IntFloatHashMap initial capacity = mapExpectedElements(expectedKeyMax) = (int) (threshold(expectedKeyMax) / 0.75f) = (int) (7812 / 0.75f) = 10416 IntFloatHashMap internal threshold = ceil(initial capacity * 0.75) = ceil(10416 * 0.75) = 7812 Internally the HPPC map enlarges during a put() when its size == 7812 *before* incrementing the size. Here the condition to upgrade to an array triggers when the map size *after* the put is > 7812, so at 7813. So I think the map first enlarges and rehashes just before we upgrade to an array, which would be wasteful. Also, the map internal threshold is ceil(initial capacity * 0.75), but it could be without ceil() for other implementations. To be safe wrt the float rounding, I suggested to add +1 in DynamicMap. mapExpectedElements(int), but it is probably better to be safe here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mikemccand commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly
mikemccand commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-610930218 Hmm here's a more exotic test failure, also likely not caused by the changes here. The fun things you learn when beasting on a 128 core box :) Hmm, though it is remotely possible my storage device or ECC RAM is flipping bits: ``` org.apache.lucene.index.TestIndexManyDocuments > test suite's output saved to /home/mike/src/simon/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestIndexManyDocuments.txt, copi\ ed below: 1> CheckIndex failed 1> 0.00% total deletions; 10976 documents; 0 deleteions 1> Segments file=segments_1 numSegments=4 version=9.0.0 id=cekay2d5izae12ssuqikoqgoc 1> 1 of 4: name=_d maxDoc=8700 1> version=9.0.0 1> id=cekay2d5izae12ssuqikoqgob 1> codec=Asserting(Lucene84) 1> compound=false 1> numFiles=11 1> size (MB)=0.003 1> diagnostics = {os.version=5.5.6-arch1-1, java.vendor=Oracle Corporation, source=merge, os.arch=amd64, mergeFactor=10, java.runtime.version=11.0.6+8-LTS, os=Linux, timestamp=1586347074798, lucene.ve\ rsion=9.0.0, java.vm.version=11.0.6+8-LTS, java.version=11.0.6, mergeMaxNumSegments=-1} 1> no deletions 1> test: open reader.OK [took 0.001 sec] 1> test: check integrity.OK [took 0.000 sec] 1> test: check live docs.OK [took 0.000 sec] 1> test: field infos.OK [1 fields] [took 0.000 sec] 1> test: field norms.OK [1 fields] [took 0.001 sec] 1> test: terms, freq, prox...ERROR: java.lang.AssertionError: buffer=java.nio.HeapByteBuffer[pos=0 lim=0 cap=0] bufferSize=1024 buffer.length=0 1> java.lang.AssertionError: buffer=java.nio.HeapByteBuffer[pos=0 lim=0 cap=0] bufferSize=1024 buffer.length=0 1>at org.apache.lucene.store.BufferedIndexInput.setBufferSize(BufferedIndexInput.java:78) 1>at org.apache.lucene.codecs.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:241) 1>at org.apache.lucene.codecs.MultiLevelSkipListReader.init(MultiLevelSkipListReader.java:208) 1>at org.apache.lucene.codecs.lucene84.Lucene84SkipReader.init(Lucene84SkipReader.java:103) 1>at org.apache.lucene.codecs.lucene84.Lucene84PostingsReader$EverythingEnum.advance(Lucene84PostingsReader.java:837) 1>at org.apache.lucene.index.FilterLeafReader$FilterPostingsEnum.advance(FilterLeafReader.java:271) 1>at org.apache.lucene.index.AssertingLeafReader$AssertingPostingsEnum.advance(AssertingLeafReader.java:377) 1>at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1426) 1>at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1867) 1>at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:720) 1>at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301) 1>at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:286) 1>at org.apache.lucene.store.BaseDirectoryWrapper.close(BaseDirectoryWrapper.java:45) 1>at org.apache.lucene.util.IOUtils.close(IOUtils.java:89) 1>at org.apache.lucene.util.IOUtils.close(IOUtils.java:77) 1>at org.apache.lucene.index.TestIndexManyDocuments.test(TestIndexManyDocuments.java:69) 1>at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 1>at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 1>at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 1>at java.base/java.lang.reflect.Method.invoke(Method.java:566) 1>at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) 1>at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) 1>at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) 1>at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) 1>at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) 1>at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) 1>at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) 1>at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) 1>at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) 1>at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 1>at com.carrotsearch.randomizedtesting.
[GitHub] [lucene-solr] uschindler edited a comment on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly
uschindler edited a comment on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-610937953 This could be related to Adrien's changes yesterday: https://issues.apache.org/jira/issue/LUCENE-9271 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] uschindler commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly
uschindler commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-610937953 This could be related to Adrien's changes yesterday: https://issues.apache.org/jira/browse/issue/LUCENE-9271 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] uschindler edited a comment on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly
uschindler edited a comment on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-610937953 This could be related to Adrien's changes yesterday: LUCENE-9271 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.
bruno-roustant commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes. URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405516244 ## File path: lucene/core/src/java/org/apache/lucene/util/fst/BitTableUtil.java ## @@ -0,0 +1,200 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.fst; + +import java.io.IOException; + +/** + * Static helper methods for {@link FST.Arc.BitTable}. + * + * @lucene.experimental + */ +class BitTableUtil { + + /** + * Returns whether the bit at given zero-based index is set. + * Example: bitIndex 10 means the third bit on the right of the second byte. + * + * @param bitIndex The bit zero-based index. It must be greater than or equal to 0, and strictly less than + * {@code number of bit-table bytes * Byte.SIZE}. + * @param reader The {@link FST.BytesReader} to read. It must be positioned at the beginning of the bit-table. + */ + static boolean isBitSet(int bitIndex, FST.BytesReader reader) throws IOException { +assert bitIndex >= 0 : "bitIndex=" + bitIndex; +reader.skipBytes(bitIndex >> 3); +return (readByte(reader) & (1L << (bitIndex & (Byte.SIZE - 1 != 0; + } + + + /** + * Counts all bits set in the bit-table. + * + * @param bitTableBytes The number of bytes in the bit-table. + * @param readerThe {@link FST.BytesReader} to read. It must be positioned at the beginning of the bit-table. + */ + static int countBits(int bitTableBytes, FST.BytesReader reader) throws IOException { +assert bitTableBytes >= 0 : "bitTableBytes=" + bitTableBytes; Review comment: > wouldn't it be more efficient to entire longs + the remainder? Yes good point. I don't know why I didn't do the same way as countBitsUpTo() below. This effectively avoids many conditions. I'll change that. > read8bytes - this is effectively the same as reader.readLong for bitcounts? Not the same, there is a difference in the byte order. reader.readLong() reads 2 ints by reading the high bytes first. read8Bytes() reads 8 bytes with low byte first. This matters when shifting the bit index. I agree that this does not matter here for bit count, but this way read8Bytes() is compatible for future use with bit index. And in addition it uses less operation since it requires 1 less bit shift and bit mask. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.
bruno-roustant commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes. URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405518985 ## File path: lucene/core/src/java/org/apache/lucene/util/fst/FSTEnum.java ## @@ -178,7 +179,7 @@ protected void doSeekCeil() throws IOException { } else { if (targetIndex < 0) { targetIndex = -1; - } else if (arc.bitTable().isBitSet(targetIndex)) { Review comment: I agree. I'd like to review FSTEnum soon. First to refactor and share common code like this, second there is still room for some perf improvement I think for seek floor/ceil. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.
bruno-roustant commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes. URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405530113 ## File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java ## @@ -195,10 +209,9 @@ posArcsStart = other.posArcsStart(); arcIdx = other.arcIdx(); numArcs = other.numArcs(); -if (nodeFlags() == ARCS_FOR_DIRECT_ADDRESSING) { - bitTable = other.bitTable() == null ? null : other.bitTable().copy(); - firstLabel = other.firstLabel(); -} +bitTableStart = other.bitTableStart; Review comment: Ok. You've debugged this code for a long time so now you know :) I'll change that and I'll put a comment to explain. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on issue #1412: Add MinimalSolrTest for scale testing
dsmiley commented on issue #1412: Add MinimalSolrTest for scale testing URL: https://github.com/apache/lucene-solr/pull/1412#issuecomment-610972348 We *could* have hard timeouts if they are run by a specific CI machine, perhaps @sarowe real hardware? Before this gets committed, we need to ensure it is not run _yet_ by default because it isn't asserting anything. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910 > The javadoc in the constructor is good, but I'm concerned that this util class may be used elsewhere without clearly reading/understanding the risk I'm not very concern about this point much. If we gonna put this class in guava or some places like that I think it is worth to spend more time to documents or make the api right. But these classes will get used in Solr and when they use `DynamicMaps` they must have a clear idea of what they gonna use. I just want to avoid introduce more logic to these classes since one change needs to propagate and maintain in others. > Did you test and debug when the DynamicMap upgrades from a map to an array internally? I mean in debug mode step by step here. I think the map first enlarges and rehashes just before the upgrade to an array. It seems that your calculation missed the part that arraySize must be powerOfTwo. So I will go from backward with `arraySize=1024` -> `mapExpectedElements=768 (768/0.75)` -> `threshold = 768 * 0.75 = 576` -> `resizeAt=768 (arraySize * loadFactor)` so we kinda safe here, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910 > The javadoc in the constructor is good, but I'm concerned that this util class may be used elsewhere without clearly reading/understanding the risk I'm not very concern about this point much. If we gonna put this class in guava or some places like that I think it is worth to spend more time to documents or make the api right. But these classes will get used in Solr and when they use `DynamicMaps` they must have a clear idea of what they gonna use. I just want to avoid introduce more logic to these classes since one change needs to propagate and maintain in others. > Did you test and debug when the DynamicMap upgrades from a map to an array internally? I mean in debug mode step by step here. I think the map first enlarges and rehashes just before the upgrade to an array. It seems that your calculation missed the part that arraySize must be powerOfTwo. So I will go from backward with `arraySize=1024` -> `mapExpectedElements=768 (arraySize = expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576` -> `resizeAt=768 (arraySize * loadFactor)` so we kinda safe here, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910 > The javadoc in the constructor is good, but I'm concerned that this util class may be used elsewhere without clearly reading/understanding the risk I'm not very concern about this point much. If we gonna put this class in guava or some places like that I think it is worth to spend more time to documents or make the api right. But these classes will get used in Solr and when they use `DynamicMaps` they must have a clear idea of what they gonna use. I just want to avoid introduce more logic to these classes since one change needs to propagate and maintain in others. > Did you test and debug when the DynamicMap upgrades from a map to an array internally? I mean in debug mode step by step here. I think the map first enlarges and rehashes just before the upgrade to an array. It seems that your calculation missed the part that arraySize must be powerOfTwo. So I will go from backward with `arraySize=1024` -> `mapExpectedElements=768 (arraySize = expectedElements / 0.75)` -> `threshold = 768 * 0.75 = 576` -> `resizeAt=768 (arraySize * loadFactor)` so we kinda safe here, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910 > The javadoc in the constructor is good, but I'm concerned that this util class may be used elsewhere without clearly reading/understanding the risk I'm not very concern about this point much. If we gonna put this class in guava or some places like that I think it is worth to spend more time to documents or make the api right. But these classes will get used in Solr and when they use `DynamicMaps` they must have a clear idea of what they gonna use. I just want to avoid introduce more logic to these classes since one change needs to propagate and maintain in others. > Did you test and debug when the DynamicMap upgrades from a map to an array internally? I mean in debug mode step by step here. I think the map first enlarges and rehashes just before the upgrade to an array. It seems that your calculation missed the part that arraySize must be powerOfTwo and `initialCapacity` is not equals to `expectedElements`. So I will go from backward with `arraySize=1024` -> `mapExpectedElements=768 (arraySize = expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576` -> `resizeAt=768 (arraySize * loadFactor)` so we kinda safe here, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910 > The javadoc in the constructor is good, but I'm concerned that this util class may be used elsewhere without clearly reading/understanding the risk I'm not very concern about this point much. If we gonna put this class in guava or some places like that I think it is worth to spend more time to documents or make the api right. But these classes will get used in Solr and when they use `DynamicMaps` they must have a clear idea of what they gonna use (why they want to use DynamicMaps insteads of hppc maps or java maps). I just want to avoid introduce more logic to these classes since one change needs to propagate and maintain in others. > Did you test and debug when the DynamicMap upgrades from a map to an array internally? I mean in debug mode step by step here. I think the map first enlarges and rehashes just before the upgrade to an array. It seems that your calculation missed the part that arraySize must be powerOfTwo and `initialCapacity` is not equals to `expectedElements`. So I will go from backward with `arraySize=1024` -> `mapExpectedElements=768 (arraySize = expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576` -> `resizeAt=768 (arraySize * loadFactor)` so we kinda safe here, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14210) Add replica state option for HealthCheckHandler
[ https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078326#comment-17078326 ] Jan Høydahl commented on SOLR-14210: bq. Why wouldn't you add the additional param that's in Javadocs to the Ref Guide? I could not find other examples of http params on that page, and wanted to stay DRY and just link to the Javadocs. bq. If it is better to have users review Javadocs instead of adding to the Ref Guide, perhaps you could make it a link instead of making them go find the Javadocs and then find the class? The link syntax would be like this: The "Class & Javadocs" column of the table already provides a link to the Javadocs of that class. I could of course repeat the same link inline in that new paragraph for clarity. > Add replica state option for HealthCheckHandler > --- > > Key: SOLR-14210 > URL: https://issues.apache.org/jira/browse/SOLR-14210 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 8.5 >Reporter: Houston Putman >Assignee: Jan Høydahl >Priority: Major > Fix For: 8.6 > > Attachments: docs.patch > > Time Spent: 4.5h > Remaining Estimate: 0h > > h2. Background > As was brought up in SOLR-13055, in order to run Solr in a more cloud-native > way, we need some additional features around node-level healthchecks. > {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe > explained in > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n] > determine if a node is live and ready to serve live traffic. > {quote} > > However there are issues around kubernetes managing it's own rolling > restarts. With the current healthcheck setup, it's easy to envision a > scenario in which Solr reports itself as "healthy" when all of its replicas > are actually recovering. Therefore kubernetes, seeing a healthy pod would > then go and restart the next Solr node. This can happen until all replicas > are "recovering" and none are healthy. (maybe the last one restarted will be > "down", but still there are no "active" replicas) > h2. Proposal > I propose we make an additional healthcheck handler that returns whether all > replicas hosted by that Solr node are healthy and "active". That way we will > be able to use the [default kubernetes rolling restart > logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies] > with Solr. > To add on to [Jan's point > here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559], > this handler should be more friendly for other Content-Types and should use > bettter HTTP response statuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910 > The javadoc in the constructor is good, but I'm concerned that this util class may be used elsewhere without clearly reading/understanding the risk I'm not very concern about this point much. If we gonna put this class in guava or some places like that I think it is worth to spend more time to documents or make the api right. But these classes will get used in Solr and when they use `DynamicMaps` they must have a clear idea of what they gonna use (why they want to use DynamicMaps insteads of hppc maps or java maps). I just want to avoid introduce more logic to these classes since one change needs to propagate and maintain in others. > Did you test and debug when the DynamicMap upgrades from a map to an array internally? I mean in debug mode step by step here. I think the map first enlarges and rehashes just before the upgrade to an array. It seems that your calculation missed the part that arraySize must be powerOfTwo and `initialCapacity` is not equals to `expectedElements`. So I will go from backward with `arraySize=1024` -> `mapExpectedElements=768 (arraySize = expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576` -> `resizeAt=768 (arraySize * loadFactor)` so basically `threshold < mapExpectedElements <= resizeAt`. We actually can compute maxExpectedElements = (int) threshold / 0.95. ( min value of threshold is 64 since we skipping using map when maxSize < 2^12 ) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910 > The javadoc in the constructor is good, but I'm concerned that this util class may be used elsewhere without clearly reading/understanding the risk I'm not very concern about this point much. If we gonna put this class in guava or some places like that I think it is worth to spend more time to documents or make the api right. But these classes will get used in Solr and when they use `DynamicMaps` they must have a clear idea of what they gonna use (why they want to use DynamicMaps insteads of hppc maps or java maps). I just want to avoid introduce more logic to these classes since one change needs to propagate and maintain in others. > Did you test and debug when the DynamicMap upgrades from a map to an array internally? I mean in debug mode step by step here. I think the map first enlarges and rehashes just before the upgrade to an array. It seems that your calculation missed the part that arraySize must be powerOfTwo and `initialCapacity` is not equals to `expectedElements`. So I will go from backward with `arraySize=1024` -> `mapExpectedElements=768 (arraySize = expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576` -> `resizeAt=768 (arraySize * loadFactor)` I realized that `threshold < mapExpectedElements <= resizeAt`. So we actually can compute maxExpectedElements = threshold + 2, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910 > The javadoc in the constructor is good, but I'm concerned that this util class may be used elsewhere without clearly reading/understanding the risk I'm not very concern about this point much. If we gonna put this class in guava or some places like that I think it is worth to spend more time to documents or make the api right. But these classes will get used in Solr and when they use `DynamicMaps` they must have a clear idea of what they gonna use (why they want to use DynamicMaps insteads of hppc maps or java maps). I just want to avoid introduce more logic to these classes since one change needs to propagate and maintain in others. > Did you test and debug when the DynamicMap upgrades from a map to an array internally? I mean in debug mode step by step here. I think the map first enlarges and rehashes just before the upgrade to an array. It seems that your calculation missed the part that arraySize must be powerOfTwo and `initialCapacity` is not equals to `expectedElements`. So I will go from backward with `arraySize=1024` -> `mapExpectedElements=768 (arraySize = expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576` -> `resizeAt=768 (arraySize * loadFactor)` I realized that `threshold < mapExpectedElements <= resizeAt`. So we actually can compute maxExpectedElements = threshold - 2, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects
mocobeta commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects URL: https://github.com/apache/lucene-solr/pull/1388#discussion_r405603524 ## File path: gradle/render-javadoc.gradle ## @@ -15,93 +15,105 @@ * limitations under the License. */ -// generate javadocs by using Ant javadoc task +// generate javadocs by calling javadoc tool +// see https://docs.oracle.com/en/java/javase/11/tools/javadoc.html + +// utility function to convert project path to document output dir +// e.g.: ':lucene:analysis:common' => 'analysis/common' +def pathToDocdir = { path -> path.split(':').drop(2).join('/') } allprojects { plugins.withType(JavaPlugin) { -ext { - javadocRoot = project.path.startsWith(':lucene') ? project(':lucene').file("build/docs") : project(':solr').file("build/docs") - javadocDestDir = "${javadocRoot}/${project.name}" -} - task renderJavadoc { - description "Generates Javadoc API documentation for the main source code. This invokes Ant Javadoc Task." + description "Generates Javadoc API documentation for the main source code. This directly invokes javadoc tool." group "documentation" ext { -linksource = "no" +linksource = false linkJUnit = false -linkHref = [] +linkLuceneProjects = [] +linkSorlProjects = [] } dependsOn sourceSets.main.compileClasspath inputs.files { sourceSets.main.java.asFileTree } - outputs.dir project.javadocRoot + outputs.dir project.javadoc.destinationDir def libName = project.path.startsWith(":lucene") ? "Lucene" : "Solr" def title = "${libName} ${project.version} ${project.name} API".toString() + // absolute urls for "-linkoffline" option + def javaSEDocUrl = "https://docs.oracle.com/en/java/javase/11/docs/api/"; + def junitDocUrl = "https://junit.org/junit4/javadoc/4.12/"; + def luceneDocUrl = "https://lucene.apache.org/core/${project.version.replace(".", "_")}".toString() + def solrDocUrl = "https://lucene.apache.org/solr/${project.version.replace(".", "_")}".toString() + + def javadocCmd = org.gradle.internal.jvm.Jvm.current().getJavadocExecutable() Review comment: I did a little more experiments with `org.gradle.internal.jvm.Jvm.current()`, which is used both on compilation and test execution, the search path is 1. org.gradle.java.home 2. $JAVA_HOME 3. user's default java (on $PATH) It's consistent with their documentation. Elasticsearch's custom build plugin takes completely different search strategy from gradle's. 1. "compiler.java" system property 2. $JAVA_HOME 3. org.gradle.java.home 4. user's default java (on $PATH) (I didn't run it but just interpreted this method https://github.com/elastic/elasticsearch/blob/master/buildSrc/src/main/java/org/elasticsearch/gradle/info/GlobalBuildInfoPlugin.java#L209) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bruno-roustant commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
bruno-roustant commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-611020236 Yes I missed the power of 2. So I'll just let you double check this works without wasteful map resize. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects
mocobeta commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects URL: https://github.com/apache/lucene-solr/pull/1388#discussion_r405606070 ## File path: gradle/render-javadoc.gradle ## @@ -15,93 +15,105 @@ * limitations under the License. */ -// generate javadocs by using Ant javadoc task +// generate javadocs by calling javadoc tool +// see https://docs.oracle.com/en/java/javase/11/tools/javadoc.html + +// utility function to convert project path to document output dir +// e.g.: ':lucene:analysis:common' => 'analysis/common' +def pathToDocdir = { path -> path.split(':').drop(2).join('/') } allprojects { plugins.withType(JavaPlugin) { -ext { - javadocRoot = project.path.startsWith(':lucene') ? project(':lucene').file("build/docs") : project(':solr').file("build/docs") - javadocDestDir = "${javadocRoot}/${project.name}" -} - task renderJavadoc { - description "Generates Javadoc API documentation for the main source code. This invokes Ant Javadoc Task." + description "Generates Javadoc API documentation for the main source code. This directly invokes javadoc tool." group "documentation" ext { -linksource = "no" +linksource = false linkJUnit = false -linkHref = [] +linkLuceneProjects = [] +linkSorlProjects = [] } dependsOn sourceSets.main.compileClasspath inputs.files { sourceSets.main.java.asFileTree } - outputs.dir project.javadocRoot + outputs.dir project.javadoc.destinationDir def libName = project.path.startsWith(":lucene") ? "Lucene" : "Solr" def title = "${libName} ${project.version} ${project.name} API".toString() + // absolute urls for "-linkoffline" option + def javaSEDocUrl = "https://docs.oracle.com/en/java/javase/11/docs/api/"; + def junitDocUrl = "https://junit.org/junit4/javadoc/4.12/"; + def luceneDocUrl = "https://lucene.apache.org/core/${project.version.replace(".", "_")}".toString() + def solrDocUrl = "https://lucene.apache.org/solr/${project.version.replace(".", "_")}".toString() + + def javadocCmd = org.gradle.internal.jvm.Jvm.current().getJavadocExecutable() Review comment: I am going to merge it to master branch since I think I understand what I did here with `org.gradle.internal.jvm.Jvm.current()`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] sigram opened a new pull request #1417: SOLR-12847: Auto-create a policy rule that corresponds to maxShardsPerNode
sigram opened a new pull request #1417: SOLR-12847: Auto-create a policy rule that corresponds to maxShardsPerNode URL: https://github.com/apache/lucene-solr/pull/1417 # Description Please provide a short description of the changes you're making with this pull request. # Solution Please provide a short description of the approach taken to implement your solution. # Tests Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem. # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [ ] I have created a Jira issue and added the issue ID to my pull request title. - [ ] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `master` branch. - [ ] I have run `ant precommit` and the appropriate test suite. - [ ] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078431#comment-17078431 ] Bruno Roustant commented on LUCENE-9286: That's strange. In the PR I integrated your code to recompile and walk the FST. See TestFSTDirectAddressing.main() with a first arg "-recompileAndWalk" and a second the path to an FST. I used the FST zip you provided "fst-17291407798783309064.fst.gz". Before the patch, I got roughly the same perf as you got on your side and that you shared previously. Then with the patch, I can verify that the perf is fixed: {code:java} Reading FST time = 402 ms FST construction (oversizingFactor=0.0) time = 1302 ms FST RAM = 56055936 B FST enum time = 322 ms FST construction (oversizingFactor=1.0) time = 1235 ms FST RAM = 54945816 B FST enum time = 239 ms {code} Can you run this TestFSTDirectAddressing.main()? I run it on master branch. Should I run it on branch 8x to reproduce your env? > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 1h > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078431#comment-17078431 ] Bruno Roustant edited comment on LUCENE-9286 at 4/8/20, 4:10 PM: - That's strange. In the PR I integrated your code to recompile and walk the FST. See TestFSTDirectAddressing.main() with a first arg "-recompileAndWalk" and a second the path to an FST. I used the FST zip you provided "fst-17291407798783309064.fst.gz". Before the patch, I got roughly the same perf as you got on your side and that you shared previously. Then with the patch, I can verify that the perf is fixed: {code:java} Reading FST time = 402 ms FST construction (oversizingFactor=0.0) time = 1302 ms FST RAM = 56055936 B FST enum time = 322 ms FST construction (oversizingFactor=1.0) time = 1235 ms FST RAM = 54945816 B FST enum time = 239 ms {code} Can you run this TestFSTDirectAddressing.main()? I run it on master branch. Should I run it on branch 8x to reproduce your env? was (Author: broustant): That's strange. In the PR I integrated your code to recompile and walk the FST. See TestFSTDirectAddressing.main() with a first arg "-recompileAndWalk" and a second the path to an FST. I used the FST zip you provided "fst-17291407798783309064.fst.gz". Before the patch, I got roughly the same perf as you got on your side and that you shared previously. Then with the patch, I can verify that the perf is fixed: {code:java} Reading FST time = 402 ms FST construction (oversizingFactor=0.0) time = 1302 ms FST RAM = 56055936 B FST enum time = 322 ms FST construction (oversizingFactor=1.0) time = 1235 ms FST RAM = 54945816 B FST enum time = 239 ms {code} Can you run this TestFSTDirectAddressing.main()? I run it on master branch. Should I run it on branch 8x to reproduce your env? > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 1h > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] bringyou commented on issue #1389: LUCENE-9298: fix clearDeletedDocIds in BufferedUpdates
bringyou commented on issue #1389: LUCENE-9298: fix clearDeletedDocIds in BufferedUpdates URL: https://github.com/apache/lucene-solr/pull/1389#issuecomment-611049924 > Change looks good to me. Would you mind adding a small test for this issue? Thanks @bringyou! sorry for the delay~ add a test for `BufferedUpdates` and change a bit more code, please take another look @dnhatn This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13979) Expose separate metrics for distributed and non-distributed requests
[ https://issues.apache.org/jira/browse/SOLR-13979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078488#comment-17078488 ] David Smiley commented on SOLR-13979: - bq. These metrics are added to all handlers that inherit from RequestHandlerBase. This may be probably a bit too many Yes its too many; distrib is implemented by SearchHandler. I feel defaulting to to track distrib metrics on the vast majority of request handlers that are not SearchHandlers pollutes Solr metrics with junk. > Expose separate metrics for distributed and non-distributed requests > > > Key: SOLR-13979 > URL: https://issues.apache.org/jira/browse/SOLR-13979 > Project: Solr > Issue Type: Bug > Components: metrics >Reporter: Shalin Shekhar Mangar >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 8.4 > > Attachments: SOLR-13979.patch > > > Currently we expose metrics such as count, rate and latency on a per handler > level however for search requests there is no distinction made for distrib vs > non-distrib requests. This means that there is no way to find the count, rate > or latency of only user-sent queries. > I propose that we expose distrib vs non-distrib metrics separately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13979) Expose separate metrics for distributed and non-distributed requests
[ https://issues.apache.org/jira/browse/SOLR-13979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078489#comment-17078489 ] David Smiley commented on SOLR-13979: - Also, distributed search long predated SolrCloud so why should this be a SolrCloud dependent switch? > Expose separate metrics for distributed and non-distributed requests > > > Key: SOLR-13979 > URL: https://issues.apache.org/jira/browse/SOLR-13979 > Project: Solr > Issue Type: Bug > Components: metrics >Reporter: Shalin Shekhar Mangar >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 8.4 > > Attachments: SOLR-13979.patch > > > Currently we expose metrics such as count, rate and latency on a per handler > level however for search requests there is no distinction made for distrib vs > non-distrib requests. This means that there is no way to find the count, rate > or latency of only user-sent queries. > I propose that we expose distrib vs non-distrib metrics separately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078491#comment-17078491 ] Dawid Weiss commented on LUCENE-9286: - Let me double check and get back to you. Sorry for the delays, lots of things going on at home when you're locked up with three kids. > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 1h > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.
dweiss commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes. URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405681455 ## File path: lucene/core/src/java/org/apache/lucene/util/fst/BitTableUtil.java ## @@ -0,0 +1,200 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.fst; + +import java.io.IOException; + +/** + * Static helper methods for {@link FST.Arc.BitTable}. + * + * @lucene.experimental + */ +class BitTableUtil { + + /** + * Returns whether the bit at given zero-based index is set. + * Example: bitIndex 10 means the third bit on the right of the second byte. + * + * @param bitIndex The bit zero-based index. It must be greater than or equal to 0, and strictly less than + * {@code number of bit-table bytes * Byte.SIZE}. + * @param reader The {@link FST.BytesReader} to read. It must be positioned at the beginning of the bit-table. + */ + static boolean isBitSet(int bitIndex, FST.BytesReader reader) throws IOException { +assert bitIndex >= 0 : "bitIndex=" + bitIndex; +reader.skipBytes(bitIndex >> 3); +return (readByte(reader) & (1L << (bitIndex & (Byte.SIZE - 1 != 0; + } + + + /** + * Counts all bits set in the bit-table. + * + * @param bitTableBytes The number of bytes in the bit-table. + * @param readerThe {@link FST.BytesReader} to read. It must be positioned at the beginning of the bit-table. + */ + static int countBits(int bitTableBytes, FST.BytesReader reader) throws IOException { +assert bitTableBytes >= 0 : "bitTableBytes=" + bitTableBytes; Review comment: I noticed the byte order difference (note the "for bitcounts" bit). :) My gut feeling is that pushing reads so that they're aggregated first, followed by a bitcount will still give you a performance improvement. A bit shift and a bit mask is probably dwarfed when hotspot compiles and inlines all this but single-byte get() methods with conditionals inside will typically perform worse than a bulk get. This is a scholarly discussion as things will very likely vary from machine to machine and even between hotpot runs, depending on the calling code layout. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14210) Add replica state option for HealthCheckHandler
[ https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078500#comment-17078500 ] Cassandra Targett commented on SOLR-14210: -- bq. The "Class & Javadocs" column of the table already provides a link to the Javadocs of that class. Ah OK, I didn't look at it in the context of the whole page - no need to duplicate it then IMO. > Add replica state option for HealthCheckHandler > --- > > Key: SOLR-14210 > URL: https://issues.apache.org/jira/browse/SOLR-14210 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 8.5 >Reporter: Houston Putman >Assignee: Jan Høydahl >Priority: Major > Fix For: 8.6 > > Attachments: docs.patch > > Time Spent: 4.5h > Remaining Estimate: 0h > > h2. Background > As was brought up in SOLR-13055, in order to run Solr in a more cloud-native > way, we need some additional features around node-level healthchecks. > {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe > explained in > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n] > determine if a node is live and ready to serve live traffic. > {quote} > > However there are issues around kubernetes managing it's own rolling > restarts. With the current healthcheck setup, it's easy to envision a > scenario in which Solr reports itself as "healthy" when all of its replicas > are actually recovering. Therefore kubernetes, seeing a healthy pod would > then go and restart the next Solr node. This can happen until all replicas > are "recovering" and none are healthy. (maybe the last one restarted will be > "down", but still there are no "active" replicas) > h2. Proposal > I propose we make an additional healthcheck handler that returns whether all > replicas hosted by that Solr node are healthy and "active". That way we will > be able to use the [default kubernetes rolling restart > logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies] > with Solr. > To add on to [Jan's point > here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559], > this handler should be more friendly for other Content-Types and should use > bettter HTTP response statuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects
dweiss commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects URL: https://github.com/apache/lucene-solr/pull/1388#discussion_r405684724 ## File path: gradle/render-javadoc.gradle ## @@ -15,93 +15,105 @@ * limitations under the License. */ -// generate javadocs by using Ant javadoc task +// generate javadocs by calling javadoc tool +// see https://docs.oracle.com/en/java/javase/11/tools/javadoc.html + +// utility function to convert project path to document output dir +// e.g.: ':lucene:analysis:common' => 'analysis/common' +def pathToDocdir = { path -> path.split(':').drop(2).join('/') } allprojects { plugins.withType(JavaPlugin) { -ext { - javadocRoot = project.path.startsWith(':lucene') ? project(':lucene').file("build/docs") : project(':solr').file("build/docs") - javadocDestDir = "${javadocRoot}/${project.name}" -} - task renderJavadoc { - description "Generates Javadoc API documentation for the main source code. This invokes Ant Javadoc Task." + description "Generates Javadoc API documentation for the main source code. This directly invokes javadoc tool." group "documentation" ext { -linksource = "no" +linksource = false linkJUnit = false -linkHref = [] +linkLuceneProjects = [] +linkSorlProjects = [] } dependsOn sourceSets.main.compileClasspath inputs.files { sourceSets.main.java.asFileTree } - outputs.dir project.javadocRoot + outputs.dir project.javadoc.destinationDir def libName = project.path.startsWith(":lucene") ? "Lucene" : "Solr" def title = "${libName} ${project.version} ${project.name} API".toString() + // absolute urls for "-linkoffline" option + def javaSEDocUrl = "https://docs.oracle.com/en/java/javase/11/docs/api/"; + def junitDocUrl = "https://junit.org/junit4/javadoc/4.12/"; + def luceneDocUrl = "https://lucene.apache.org/core/${project.version.replace(".", "_")}".toString() + def solrDocUrl = "https://lucene.apache.org/solr/${project.version.replace(".", "_")}".toString() + + def javadocCmd = org.gradle.internal.jvm.Jvm.current().getJavadocExecutable() Review comment: Thanks for looking into this, Tomoko. We may have to do something similar to what ES does since we want to be able to run javac, javadocs and tests against new JVMs (which gradle itself may not support yet). It's a different issue though and it can certainly wait. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness
[ https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated SOLR-13132: -- Attachment: SOLR-13132_testSweep.patch Status: Open (was: Open) bq. ... In fact, otherAccs and resort, being likely to generate more DocSet lookups than refinement, make it all the more important that SKGSlotAcc respect cacheDf to control filterCache usage, no? off the top of my head, I'm not certain that they will involved _more_ lookups then refinement, but it certainly seems like if it's useful for refinement, it would also be useful for those cases as well. bq. ... I plan to work through them in the next day or two and address any questions as they come up. Sweet. I went ahead and started working on an "equivilence testing" patch to try and help definitively prove that using {{swep: true}} or {{sweep: false}} produce the same results on otherwise equivilent (randomly generted) facet requests. I'm attaching that as {{SOLR-13132_testSweep.patch}}. The big missing piece here is a stubbed out "whitebox" test (see nocommits) to use the debug output to "prove" that sweep collection is actualy being used when/if expected based on the {{sweep}} param (and effective processor). * As is on master this test passes (because nothing looks for a {{sweep}} param so it's just comparing queries with themselves). * When modifying this patch to use {{disable_sweep_collection}} it passed reliably from what i could tell. ...once your major changes to the impl are done, we'll probably wnat more changes to this test to help tickle "edge code paths" (once we have a better handle on what they are .. for instance: right now only one sweep using {{relatedness()}} function per facet, but i'm pretty sure testing multiple sweep aggs in a single query, and mixing in some non sweep functions, will be important for code coverage. > Improve JSON "terms" facet performance when sorted by relatedness > -- > > Key: SOLR-13132 > URL: https://issues.apache.org/jira/browse/SOLR-13132 > Project: Solr > Issue Type: Improvement > Components: Facet Module >Affects Versions: 7.4, master (9.0) >Reporter: Michael Gibney >Priority: Major > Attachments: SOLR-13132-with-cache-01.patch, > SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate > {{relatedness}} for every term. > The current implementation uses a standard uninverted approach (either > {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain > base docSet, and then uses that initial pass as a pre-filter for a > second-pass, inverted approach of fetching docSets for each relevant term > (i.e., {{count > minCount}}?) and calculating intersection size of those sets > with the domain base docSet. > Over high-cardinality fields, the overhead of per-term docSet creation and > set intersection operations increases request latency to the point where > relatedness sort may not be usable in practice (for my use case, even after > applying the patch for SOLR-13108, for a field with ~220k unique terms per > core, QTime for high-cardinality domain docSets were, e.g.: cardinality > 1816684=9000ms, cardinality 5032902=18000ms). > The attached patch brings the above example QTimes down to a manageable > ~300ms and ~250ms respectively. The approach calculates uninverted facet > counts over domain base, foreground, and background docSets in parallel in a > single pass. This allows us to take advantage of the efficiencies built into > the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids > the per-term docSet creation and set intersection overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14385) Add shard name and collection name to split histogram logs
[ https://issues.apache.org/jira/browse/SOLR-14385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078516#comment-17078516 ] Saatchi Bhalla commented on SOLR-14385: --- Hadn't realized that access to MDC variables is actually defined in the log4j2.xml files which include both collection and shard. Thanks for pointing that out Yonik and for your help David. Megan and I synched up offline and realized that the missing data is specific to our solr fork so we can close out this JIRA. > Add shard name and collection name to split histogram logs > -- > > Key: SOLR-14385 > URL: https://issues.apache.org/jira/browse/SOLR-14385 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Saatchi Bhalla >Priority: Trivial > Time Spent: 1h > Remaining Estimate: 0h > > Using shard name from MDC to include in split histogram logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9310) IntelliJ attempts to resolve provider property in jar manifest configuration and fails during project import
[ https://issues.apache.org/jira/browse/LUCENE-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-9310: Summary: IntelliJ attempts to resolve provider property in jar manifest configuration and fails during project import (was: IntelliJ attempts to resolve provider property in jar manifest configuration and fails) > IntelliJ attempts to resolve provider property in jar manifest configuration > and fails during project import > > > Key: LUCENE-9310 > URL: https://issues.apache.org/jira/browse/LUCENE-9310 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > > It shouldn't be the case but it is. I don't know why. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9310) IntelliJ attempts to resolve provider property in jar manifest configuration and fails
Dawid Weiss created LUCENE-9310: --- Summary: IntelliJ attempts to resolve provider property in jar manifest configuration and fails Key: LUCENE-9310 URL: https://issues.apache.org/jira/browse/LUCENE-9310 Project: Lucene - Core Issue Type: Task Reporter: Dawid Weiss Assignee: Dawid Weiss It shouldn't be the case but it is. I don't know why. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14396) TaggerRequestHandler Should Not Error on Empty Collection
Trey Grainger created SOLR-14396: Summary: TaggerRequestHandler Should Not Error on Empty Collection Key: SOLR-14396 URL: https://issues.apache.org/jira/browse/SOLR-14396 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Trey Grainger The TaggerRequestHandler (added in SOLR-12376) currently returns a 400 (Bad Request) if used on a collection with no terms in the index. This probably made sense for the use cases for which it was originally written (in the OpenSextant project, before it was contributed to Solr) that focused on on stand-alone document tagging, where the calling application expected there to always be an index. More and more use cases are emerging for using the TaggerRequestHandler in real-time for user queries, however. For example, real-time phrase matching and entity resolution in queries. In these cases, the data in the tagger collection may be dynamically updated, and at times, the collection may even be empty. While it's certainly possible for the 400 error to be handled client-side for empty collections, the incoming requests aren't really "bad" requests in my opinion, the index just doesn't have any data yet. Sending the same request subsequently once some documents are indexed would result in a success. I'm proposing we remove the exception for empty indexes and simply return no matched tags instead. If it's important for anyone to preserve the current behavior, we could add a parameter "errorOnEmptyCollection". Does anyone think preserving the error here is needed? What say you [~dsmiley]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9310) IntelliJ attempts to resolve provider property in jar manifest configuration and fails during project import
[ https://issues.apache.org/jira/browse/LUCENE-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078527#comment-17078527 ] ASF subversion and git services commented on LUCENE-9310: - Commit dbb4be1ca93607c2555fe8b2b2cb3318be582edb in lucene-solr's branch refs/heads/master from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=dbb4be1 ] LUCENE-9310: workaround for IntelliJ gradle import > IntelliJ attempts to resolve provider property in jar manifest configuration > and fails during project import > > > Key: LUCENE-9310 > URL: https://issues.apache.org/jira/browse/LUCENE-9310 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > > It shouldn't be the case but it is. I don't know why. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078537#comment-17078537 ] Dawid Weiss commented on LUCENE-9286: - Now... this is a head scratcher. I get this on your test code (same fst): {code} FST construction (oversizingFactor=1.0) time = 1753 ms FST RAM = 54945816 B FST enum time = 323 ms {code} I'll get to the bottom of this difference, give me some time please. > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078564#comment-17078564 ] Dawid Weiss commented on LUCENE-9286: - My repro was a test... and I ran with enabled assertions. Sorry for the confusion this might have caused! These assertions are *very* costly - can we tone them down just a little bit? > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9310) IntelliJ attempts to resolve provider property in jar manifest configuration and fails during project import
[ https://issues.apache.org/jira/browse/LUCENE-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-9310. - Fix Version/s: master (9.0) Resolution: Fixed > IntelliJ attempts to resolve provider property in jar manifest configuration > and fails during project import > > > Key: LUCENE-9310 > URL: https://issues.apache.org/jira/browse/LUCENE-9310 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0) > > > It shouldn't be the case but it is. I don't know why. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9311) IntelliJ import attempts to compile solr-ref-guide tools/ and fails
Dawid Weiss created LUCENE-9311: --- Summary: IntelliJ import attempts to compile solr-ref-guide tools/ and fails Key: LUCENE-9311 URL: https://issues.apache.org/jira/browse/LUCENE-9311 Project: Lucene - Core Issue Type: Sub-task Reporter: Dawid Weiss This used to work but now doesn't. Don't know why (we exclude customized ant tasks but IntelliJ doesn't seem to pick this up). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness
[ https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078567#comment-17078567 ] Michael Gibney commented on SOLR-13132: --- This is great, thanks! It's taking me a little longer than expected to sharpen focus around exactly how to do this; but it's going steadily at this point, and will I think address all your concerns mentioned in earlier comments. More soon ... > Improve JSON "terms" facet performance when sorted by relatedness > -- > > Key: SOLR-13132 > URL: https://issues.apache.org/jira/browse/SOLR-13132 > Project: Solr > Issue Type: Improvement > Components: Facet Module >Affects Versions: 7.4, master (9.0) >Reporter: Michael Gibney >Priority: Major > Attachments: SOLR-13132-with-cache-01.patch, > SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate > {{relatedness}} for every term. > The current implementation uses a standard uninverted approach (either > {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain > base docSet, and then uses that initial pass as a pre-filter for a > second-pass, inverted approach of fetching docSets for each relevant term > (i.e., {{count > minCount}}?) and calculating intersection size of those sets > with the domain base docSet. > Over high-cardinality fields, the overhead of per-term docSet creation and > set intersection operations increases request latency to the point where > relatedness sort may not be usable in practice (for my use case, even after > applying the patch for SOLR-13108, for a field with ~220k unique terms per > core, QTime for high-cardinality domain docSets were, e.g.: cardinality > 1816684=9000ms, cardinality 5032902=18000ms). > The attached patch brings the above example QTimes down to a manageable > ~300ms and ~250ms respectively. The approach calculates uninverted facet > counts over domain base, foreground, and background docSets in parallel in a > single pass. This allows us to take advantage of the efficiencies built into > the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids > the per-term docSet creation and set intersection overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14396) TaggerRequestHandler Should Not Error on Empty Collection
[ https://issues.apache.org/jira/browse/SOLR-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078602#comment-17078602 ] David Smiley commented on SOLR-14396: - Woah; I think it's bad this throws an exception on an empty index -- no need to convince me of that. I can see why I added this to begin with, however -- imagine you typo a dynamic field or something like that. Still, that scenario is not specific to the tagger; doing plain old search on a field with no data does not and should not result in warnings or errors. We don't need a back-compat parameter to toggle this. It'd be useful during Solr development / ad-hoc queries if Solr had a response header notice that informed you that certain fields used by the request have no data. That'd be useful Solr-wide, not just specifically this handler, but also a pain to do as I think about it. Hmm. > TaggerRequestHandler Should Not Error on Empty Collection > - > > Key: SOLR-14396 > URL: https://issues.apache.org/jira/browse/SOLR-14396 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Trey Grainger >Priority: Minor > > The TaggerRequestHandler (added in SOLR-12376) currently returns a 400 (Bad > Request) if used on a collection with no terms in the index. This probably > made sense for the use cases for which it was originally written (in the > OpenSextant project, before it was contributed to Solr) that focused on on > stand-alone document tagging, where the calling application expected there to > always be an index. > More and more use cases are emerging for using the TaggerRequestHandler in > real-time for user queries, however. For example, real-time phrase matching > and entity resolution in queries. In these cases, the data in the tagger > collection may be dynamically updated, and at times, the collection may even > be empty. > While it's certainly possible for the 400 error to be handled client-side for > empty collections, the incoming requests aren't really "bad" requests in my > opinion, the index just doesn't have any data yet. Sending the same request > subsequently once some documents are indexed would result in a success. > I'm proposing we remove the exception for empty indexes and simply return no > matched tags instead. > If it's important for anyone to preserve the current behavior, we could add a > parameter "errorOnEmptyCollection". Does anyone think preserving the error > here is needed? What say you [~dsmiley]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14396) TaggerRequestHandler Should Not Error on Empty Collection
[ https://issues.apache.org/jira/browse/SOLR-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078620#comment-17078620 ] Trey Grainger commented on SOLR-14396: -- Cool. I'll take a stab at changing the behavior. If you have a suggestion for something specific to return in the header in this case (a warning of some sort), I'm happy do that while I'm in there. > TaggerRequestHandler Should Not Error on Empty Collection > - > > Key: SOLR-14396 > URL: https://issues.apache.org/jira/browse/SOLR-14396 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Trey Grainger >Priority: Minor > > The TaggerRequestHandler (added in SOLR-12376) currently returns a 400 (Bad > Request) if used on a collection with no terms in the index. This probably > made sense for the use cases for which it was originally written (in the > OpenSextant project, before it was contributed to Solr) that focused on on > stand-alone document tagging, where the calling application expected there to > always be an index. > More and more use cases are emerging for using the TaggerRequestHandler in > real-time for user queries, however. For example, real-time phrase matching > and entity resolution in queries. In these cases, the data in the tagger > collection may be dynamically updated, and at times, the collection may even > be empty. > While it's certainly possible for the 400 error to be handled client-side for > empty collections, the incoming requests aren't really "bad" requests in my > opinion, the index just doesn't have any data yet. Sending the same request > subsequently once some documents are indexed would result in a success. > I'm proposing we remove the exception for empty indexes and simply return no > matched tags instead. > If it's important for anyone to preserve the current behavior, we could add a > parameter "errorOnEmptyCollection". Does anyone think preserving the error > here is needed? What say you [~dsmiley]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9271) Make BufferedIndexInput work on a ByteBuffer
[ https://issues.apache.org/jira/browse/LUCENE-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078627#comment-17078627 ] Simon Willnauer commented on LUCENE-9271: - We ran into a test failure [here|https://github.com/apache/lucene-solr/pull/1397#issuecomment-610930218]: {noformat} org.apache.lucene.index.TestIndexManyDocuments > test suite's output saved to /home/mike/src/simon/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestIndexManyDocuments.txt, copi\ ed below: 1> CheckIndex failed 1> 0.00% total deletions; 10976 documents; 0 deleteions 1> Segments file=segments_1 numSegments=4 version=9.0.0 id=cekay2d5izae12ssuqikoqgoc 1> 1 of 4: name=_d maxDoc=8700 1> version=9.0.0 1> id=cekay2d5izae12ssuqikoqgob 1> codec=Asserting(Lucene84) 1> compound=false 1> numFiles=11 1> size (MB)=0.003 1> diagnostics = {os.version=5.5.6-arch1-1, java.vendor=Oracle Corporation, source=merge, os.arch=amd64, mergeFactor=10, java.runtime.version=11.0.6+8-LTS, os=Linux, timestamp=1586347074798, lucene.ve\ rsion=9.0.0, java.vm.version=11.0.6+8-LTS, java.version=11.0.6, mergeMaxNumSegments=-1} 1> no deletions 1> test: open reader.OK [took 0.001 sec] 1> test: check integrity.OK [took 0.000 sec] 1> test: check live docs.OK [took 0.000 sec] 1> test: field infos.OK [1 fields] [took 0.000 sec] 1> test: field norms.OK [1 fields] [took 0.001 sec] 1> test: terms, freq, prox...ERROR: java.lang.AssertionError: buffer=java.nio.HeapByteBuffer[pos=0 lim=0 cap=0] bufferSize=1024 buffer.length=0 1> java.lang.AssertionError: buffer=java.nio.HeapByteBuffer[pos=0 lim=0 cap=0] bufferSize=1024 buffer.length=0 1>at org.apache.lucene.store.BufferedIndexInput.setBufferSize(BufferedIndexInput.java:78) 1>at org.apache.lucene.codecs.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:241) 1>at org.apache.lucene.codecs.MultiLevelSkipListReader.init(MultiLevelSkipListReader.java:208) 1>at org.apache.lucene.codecs.lucene84.Lucene84SkipReader.init(Lucene84SkipReader.java:103) 1>at org.apache.lucene.codecs.lucene84.Lucene84PostingsReader$EverythingEnum.advance(Lucene84PostingsReader.java:837) 1>at org.apache.lucene.index.FilterLeafReader$FilterPostingsEnum.advance(FilterLeafReader.java:271) 1>at org.apache.lucene.index.AssertingLeafReader$AssertingPostingsEnum.advance(AssertingLeafReader.java:377) 1>at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1426) 1>at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1867) 1>at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:720) 1>at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301) 1>at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:286) 1>at org.apache.lucene.store.BaseDirectoryWrapper.close(BaseDirectoryWrapper.java:45) 1>at org.apache.lucene.util.IOUtils.close(IOUtils.java:89) 1>at org.apache.lucene.util.IOUtils.close(IOUtils.java:77) 1>at org.apache.lucene.index.TestIndexManyDocuments.test(TestIndexManyDocuments.java:69) 1>at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 1>at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 1>at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 1>at java.base/java.lang.reflect.Method.invoke(Method.java:566) 1>at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) 1>at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) 1>at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) 1>at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) 1>at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) 1>at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) 1>at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) 1>at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) 1>at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) 1>at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 1>at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370) 1>at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819) 1>at com.carrotse
[GitHub] [lucene-solr] s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly
s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-611147359 @mikemccand did you run any benchmarks on this change yet? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-11384) add support for distributed graph query
[ https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078637#comment-17078637 ] sambasivarao giddaluri commented on SOLR-11384: --- [~kwatters] is it possible to share kafka approach to over come graph query parser and also i can see with graph traversal with streaming it looses relevancy as we have to pass the sort field if we have to add multiple search conditions and pagination is not supported .. it would be really good to have the distributed search functionality on the graph query parser and can you share the patch i can test to see the performance of it. > add support for distributed graph query > --- > > Key: SOLR-11384 > URL: https://issues.apache.org/jira/browse/SOLR-11384 > Project: Solr > Issue Type: Improvement >Reporter: Kevin Watters >Priority: Minor > > Creating this ticket to track the work that I've done on the distributed > graph traversal support in solr. > Current GraphQuery will only work on a single core, which introduces some > limits on where it can be used and also complexities if you want to scale it. > I believe there's a strong desire to support a fully distributed method of > doing the Graph Query. I'm working on a patch, it's not complete yet, but if > anyone would like to have a look at the approach and implementation, I > welcome much feedback. > The flow for the distributed graph query is almost exactly the same as the > normal graph query. The only difference is how it discovers the "frontier > query" at each level of the traversal. > When a distribute graph query request comes in, each shard begins by running > the root query, to know where to start on it's shard. Each participating > shard then discovers it's edges for the next hop. Those edges are then > broadcast to all other participating shards. The shard then receives all the > parts of the frontier query , assembles it, and executes it. > This process continues on each shard until there are no new edges left, or > the maxDepth of the traversal has finished. > The approach is to introduce a FrontierBroker that resides as a singleton on > each one of the solr nodes in the cluster. When a graph query is created, it > can do a getInstance() on it so it can listen on the frontier parts coming in. > Initially, I was using an external Kafka broker to handle this, and it did > work pretty well. The new approach is migrating the FrontierBroker to be a > request handler in Solr, and potentially to use the SolrJ client to publish > the edges to each node in the cluster. > There are a few outstanding design questions, first being, how do we know > what the list of shards are that are participating in the current query > request? Is that easy info to get at? > Second, currently, we are serializing a query object between the shards, > perhaps we should consider a slightly different abstraction, and serialize > lists of "edge" objects between the nodes. The point of this would be to > batch the exploration/traversal of current frontier to help avoid large > bursts of memory being required. > Thrid, what sort of caching strategy should be introduced for the frontier > queries, if any? And if we do some caching there, how/when should the > entries be expired and auto-warmed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-11384) add support for distributed graph query
[ https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078637#comment-17078637 ] sambasivarao giddaluri edited comment on SOLR-11384 at 4/8/20, 7:32 PM: [~kwatters] is it possible to share kafka approach to over come graph query parser and also i can see graph traversal with streaming it looses relevancy as we have to pass the sort field and if we have to add multiple search conditions and pagination is not supported .. it would be really good to have the distributed search functionality on the graph query parser, can you share the patch details so i can test in local to see the performance of it. was (Author: sambasiva12): [~kwatters] is it possible to share kafka approach to over come graph query parser and also i can see with graph traversal with streaming it looses relevancy as we have to pass the sort field if we have to add multiple search conditions and pagination is not supported .. it would be really good to have the distributed search functionality on the graph query parser and can you share the patch i can test to see the performance of it. > add support for distributed graph query > --- > > Key: SOLR-11384 > URL: https://issues.apache.org/jira/browse/SOLR-11384 > Project: Solr > Issue Type: Improvement >Reporter: Kevin Watters >Priority: Minor > > Creating this ticket to track the work that I've done on the distributed > graph traversal support in solr. > Current GraphQuery will only work on a single core, which introduces some > limits on where it can be used and also complexities if you want to scale it. > I believe there's a strong desire to support a fully distributed method of > doing the Graph Query. I'm working on a patch, it's not complete yet, but if > anyone would like to have a look at the approach and implementation, I > welcome much feedback. > The flow for the distributed graph query is almost exactly the same as the > normal graph query. The only difference is how it discovers the "frontier > query" at each level of the traversal. > When a distribute graph query request comes in, each shard begins by running > the root query, to know where to start on it's shard. Each participating > shard then discovers it's edges for the next hop. Those edges are then > broadcast to all other participating shards. The shard then receives all the > parts of the frontier query , assembles it, and executes it. > This process continues on each shard until there are no new edges left, or > the maxDepth of the traversal has finished. > The approach is to introduce a FrontierBroker that resides as a singleton on > each one of the solr nodes in the cluster. When a graph query is created, it > can do a getInstance() on it so it can listen on the frontier parts coming in. > Initially, I was using an external Kafka broker to handle this, and it did > work pretty well. The new approach is migrating the FrontierBroker to be a > request handler in Solr, and potentially to use the SolrJ client to publish > the edges to each node in the cluster. > There are a few outstanding design questions, first being, how do we know > what the list of shards are that are participating in the current query > request? Is that easy info to get at? > Second, currently, we are serializing a query object between the shards, > perhaps we should consider a slightly different abstraction, and serialize > lists of "edge" objects between the nodes. The point of this would be to > batch the exploration/traversal of current frontier to help avoid large > bursts of memory being required. > Thrid, what sort of caching strategy should be introduced for the frontier > queries, if any? And if we do some caching there, how/when should the > entries be expired and auto-warmed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] s1monw opened a new pull request #1418: LUCENE-9309: Wait for #addIndexes merges when aborting merges
s1monw opened a new pull request #1418: LUCENE-9309: Wait for #addIndexes merges when aborting merges URL: https://github.com/apache/lucene-solr/pull/1418 The SegmentMerger usage in IW#addIndexes(CodecReader...) might make changes to the Directory while the IW tries to clean-up files on rollback. This causes issues like FileNotFoundExceptions when IDF tries to remove temp files. This changes adds a waiting mechanism to the abortMerges method that, in addition to the running merges, also waits for merges in addIndices(CodecReader...) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9309) IW#addIndices(CodecReader) might delete files concurrently to IW#rollback
[ https://issues.apache.org/jira/browse/LUCENE-9309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078675#comment-17078675 ] Simon Willnauer commented on LUCENE-9309: - [~mikemccand] can you take a look at the PR for this issue > IW#addIndices(CodecReader) might delete files concurrently to IW#rollback > - > > Key: LUCENE-9309 > URL: https://issues.apache.org/jira/browse/LUCENE-9309 > Project: Lucene - Core > Issue Type: Bug >Reporter: Simon Willnauer >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > During work on LUCENE-9304 [~mikemccand] ran into a failure: > {noformat} > org.apache.lucene.index.TestAddIndexes > test suite's output saved to > /home/mike/src/simon/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestAddIndexes.txt, > copied below: >> java.nio.file.NoSuchFileException: > _gt_Lucene85FieldsIndex-doc_ids_6u.tmp >> at > __randomizedtesting.SeedInfo.seed([4760FA81FBD4B2CE:A147156E5F7BD9B0]:0) >> at > org.apache.lucene.store.ByteBuffersDirectory.deleteFile(ByteBuffersDirectory.java:148) >> at > org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:607) >> at > org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38) >> at > org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:696) >> at > org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:690) >> at > org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:449) >> at > org.apache.lucene.index.IndexWriter.rollbackInternalNoCommit(IndexWriter.java:2334) >> at > org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2275) >> at > org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2268) >> at > org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback(TestAddIndexes.java:974) > 2> NOTE: reproduce with: ant test -Dtestcase=TestAddIndexes > -Dtests.method=testAddIndexesWithRollback -Dtests.seed=4760FA81FBD4B2CE > -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=fr-GP -Dtests.t\ > imezone=Asia/Tbilisi -Dtests.asserts=true -Dtests.file.encoding=UTF-8 > 2> NOTE: test params are: codec=Asserting(Lucene84): > {c=PostingsFormat(name=LuceneFixedGap), > id=PostingsFormat(name=LuceneFixedGap), > f1=PostingsFormat(name=LuceneFixedGap), f2=BlockTreeOrds(blocksize=128)\ > , version=BlockTreeOrds(blocksize=128), content=FST50}, > docValues:{dv=DocValuesFormat(name=Lucene80), > soft_delete=DocValuesFormat(name=Lucene80), > doc=DocValuesFormat(name=Lucene80), id=DocValuesFormat(name=\ > Asserting), content=DocValuesFormat(name=Asserting), > doc2d=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=982, > maxMBSortInHeap=5.837219998050092, > sim=Asserting(org.apache.lucene.search.similarities.As\ > sertingSimilarity@6ce38471), locale=fr-GP, timezone=Asia/Tbilisi > {noformat} > While this unfortunately doesn't reproduce it's likely a bug that exists for > quite some time but never showed up until LUCENE-9147 which uses a temporary > output. That's fine but with IW#addIndices(CodecReader...) not registering > the merge it does in the IW we never wait for the merge to finish while > rollback and if that merge finishes concurrently it will also remove these > .tmp files. > There are many ways to fix this and I can work on a patch, but hey do we > really need to be able to add indices while we index and do that on an open > and live IW or can it be a tool on top of it? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] s1monw commented on a change in pull request #1389: LUCENE-9298: fix clearDeletedDocIds in BufferedUpdates
s1monw commented on a change in pull request #1389: LUCENE-9298: fix clearDeletedDocIds in BufferedUpdates URL: https://github.com/apache/lucene-solr/pull/1389#discussion_r405796843 ## File path: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java ## @@ -176,8 +176,11 @@ void addBinaryUpdate(BinaryDocValuesUpdate update, int docIDUpto) { } void clearDeleteTerms() { -deleteTerms.clear(); numTermDeletes.set(0); +deleteTerms.forEach((term, docIDUpto) -> { Review comment: Instead of counting this here on clear, can we use a second counter for the deleteTerms next to `bytesUsed`? This would be great. It doesn't need to be thread safe IMO This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly
jpountz commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-611180446 @mikemccand I think it's related to the change I merged yesterday indeed. I fix it shortly after merging, so if you merge master back, this should address the failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9271) Make BufferedIndexInput work on a ByteBuffer
[ https://issues.apache.org/jira/browse/LUCENE-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078688#comment-17078688 ] Adrien Grand commented on LUCENE-9271: -- Sorry about that, it should have been fixed in 3363e1aa4897a5eca9f390a9f22cab5686305ef7 yesterday. > Make BufferedIndexInput work on a ByteBuffer > > > Key: LUCENE-9271 > URL: https://issues.apache.org/jira/browse/LUCENE-9271 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > Currently {{BufferedIndexInput}} works on a {{byte[]}} but its main > implementation, in NIOFSDirectory, has to implement a hack to maintain a > ByteBuffer view of it that it can use in calls to the FileChannel API. Maybe > we should instead make {{BufferedIndexInput}} work directly on a > {{ByteBuffer}}? This would also help reuse the existing > {{ByteBuffer#get(|Short|Int|long)}} methods instead of duplicating them from > {{DataInput}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9271) Make BufferedIndexInput work on a ByteBuffer
[ https://issues.apache.org/jira/browse/LUCENE-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078690#comment-17078690 ] Simon Willnauer commented on LUCENE-9271: - thanks [~jpountz] > Make BufferedIndexInput work on a ByteBuffer > > > Key: LUCENE-9271 > URL: https://issues.apache.org/jira/browse/LUCENE-9271 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > Currently {{BufferedIndexInput}} works on a {{byte[]}} but its main > implementation, in NIOFSDirectory, has to implement a hack to maintain a > ByteBuffer view of it that it can use in calls to the FileChannel API. Maybe > we should instead make {{BufferedIndexInput}} work directly on a > {{ByteBuffer}}? This would also help reuse the existing > {{ByteBuffer#get(|Short|Int|long)}} methods instead of duplicating them from > {{DataInput}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly
s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-611181862 @mikemccand I merged master into this branch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-8773) Make blob store usage intuitive and robust
[ https://issues.apache.org/jira/browse/SOLR-8773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078701#comment-17078701 ] David Smiley commented on SOLR-8773: [~noble.paul] I suppose all this might be Won't-Fix given the existing so-called Blob Store is deprecated? However, some of these might be re-imagined in the context of its replacement -- the "filestore" and may still make sense. > Make blob store usage intuitive and robust > -- > > Key: SOLR-8773 > URL: https://issues.apache.org/jira/browse/SOLR-8773 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > > blob store is provided as a feature and the only current use is to load jars > from there. Ideally, all resources should be loadable from blob store. But, > it is not yet ready for prime time because it is just a simple crud handler > for binary files. We should provide nice wrappers (java APIs as well as http > APIs) which mask the complexity of the underlying storage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078703#comment-17078703 ] Bruno Roustant commented on LUCENE-9286: Ah ! :) I see the faulty assertion. I'll remove it now because it's not so useful. > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-12005) Solr should have the option of logging all jars loaded
[ https://issues.apache.org/jira/browse/SOLR-12005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078704#comment-17078704 ] David Smiley commented on SOLR-12005: - Instead of looking for logging as the tool for this need, we might instead look for Solr administrative handlers to expose this information. > Solr should have the option of logging all jars loaded > -- > > Key: SOLR-12005 > URL: https://issues.apache.org/jira/browse/SOLR-12005 > Project: Solr > Issue Type: Improvement > Components: logging >Reporter: Shawn Heisey >Priority: Major > > Solr used to explicitly log the filename of every jar it loaded. It seems > that the effort to reduce the verbosity of the logs has changed this, now it > just logs the *count* of jars loaded and the paths where they were loaded > from. Here's a log line where Solr is reading from ${solr.solr.home}/lib: > {code} > 2018-02-01 17:43:20.043 INFO (main) [ ] o.a.s.c.SolrResourceLoader [null] > Added 8 libs to classloader, from paths: [/index/solr6/data/lib] > {code} > When trying to help somebody with classloader issues, it's more difficult to > help when the list of jars loaded isn't in the log. > I would like the more verbose logging to be enabled by default, but I > understand that many people would not want that, so I propose this: > * Enable verbose logging for ${solr.solr.home}/lib by default. > * Disable verbose logging for each core by default. Allow solrconfig.xml to > enable it. > * Optionally allow solr.xml to configure verbose logging at the global level. > ** This setting would affect both global and per-core jar loading. Each > solrconfig.xml could override. > Rationale: The contents of ${solr.solr.home}/lib are loaded precisely once, > and this location doesn't even exist unless a user creates it. An > out-of-the-box config would not have verbose logs from jar loading. > The solr home lib location is my preferred way of loading custom jars, > because they get loaded only once, no matter how many cores you have. Jars > added to this location would add lines to the log, but it would not be logged > for every core. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078707#comment-17078707 ] Bruno Roustant commented on LUCENE-9286: To complete the perf benchmark, I ran luceneutil on both wikimedium500k and wikimediumall. I see a perf slowdown of 4%-5% in PKLookup with FST off-heap (and only on PKLookup). Given that when it was introduced this direct addressing node improved the PKLookup perf of at least twice this slowdown, and given that this fix improves greatly the FSTEnum traversal speed and the memory for large automata, I consider this slowdown is ok. I'm going to merge the PR tomorrow. > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-4048) Move getLines out of ResourceLoader and require Charset
[ https://issues.apache.org/jira/browse/LUCENE-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley closed LUCENE-4048. > Move getLines out of ResourceLoader and require Charset > --- > > Key: LUCENE-4048 > URL: https://issues.apache.org/jira/browse/LUCENE-4048 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Chris Male >Priority: Major > Fix For: 4.0-BETA > > Attachments: LUCENE-4048.patch, LUCENE-4048.patch > > > {{ResourceLoader.getLines()}} is only used by analysis factories. > {{SolrResourceLoader}}'s implementation does the job well and it's unlikely > that another {{ResourceLoader}} implementation would handle it differently. > We should extract the {{getLines()}} method out to > {{AbstractAnalysisFactory}} so it can be used by the factories. Additionally > we shouldn't assume the files are encoded in UTF-8, instead we should allow a > Charset to be specified. > This would take us one step closer to reducing the {{ResourceLoader}} > interface just to what it says, a loader of resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4048) Move getLines out of ResourceLoader and require Charset
[ https://issues.apache.org/jira/browse/LUCENE-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley resolved LUCENE-4048. -- Fix Version/s: 4.0-BETA Resolution: Duplicate > Move getLines out of ResourceLoader and require Charset > --- > > Key: LUCENE-4048 > URL: https://issues.apache.org/jira/browse/LUCENE-4048 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Chris Male >Priority: Major > Fix For: 4.0-BETA > > Attachments: LUCENE-4048.patch, LUCENE-4048.patch > > > {{ResourceLoader.getLines()}} is only used by analysis factories. > {{SolrResourceLoader}}'s implementation does the job well and it's unlikely > that another {{ResourceLoader}} implementation would handle it differently. > We should extract the {{getLines()}} method out to > {{AbstractAnalysisFactory}} so it can be used by the factories. Additionally > we shouldn't assume the files are encoded in UTF-8, instead we should allow a > Charset to be specified. > This would take us one step closer to reducing the {{ResourceLoader}} > interface just to what it says, a loader of resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14397) Vector Search in Solr
Trey Grainger created SOLR-14397: Summary: Vector Search in Solr Key: SOLR-14397 URL: https://issues.apache.org/jira/browse/SOLR-14397 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Trey Grainger Search engines have traditionally relied upon token-based matching (typically keywords) on an inverted index, plus relevance ranking based upon keyword occurrence statistics. This can be viewed as a "sparse vector” match (where each term is a one-hot encoded dimension in the vector), since only a few keywords out of all possible keywords are considered in each query. With the introduction of deep-learning-based transformers over the last few years, however, the state of the art in relevance has moved to ranking models based upon dense vectors that encode a latent, semantic understanding of both language constructs and the underlying domain upon which the model was trained. These dense vectors are also referred to as “embeddings”. An example of this kind of embedding would be taking the phrase “chief executive officer of the tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3] . Other similar phrases should encode to vectors with very similar numbers, so we may expect a query like “CEO of a technology org” to generate a vector like [0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity calculation between these vectors, we would expect a number closer to 1.0, whereas a very unrelated text blurb would generate a much smaller cosine similarity. This is a proposal for how we should implement these vector search capabilities in Solr. h1. Search Process Overview: In order to implement dense vector search, the following process is typically followed: h2. Offline: An encoder is built. An encoder can take in text (a query, a sentence, a paragraph, a document, etc.) and return a dense vector representing that document in a rich semantic space. The semantic space is learned from training on textual data (usually, though other sources work, too), typically from the domain of the search engine. h2. Document Ingestion: When documents are processed, they are passed to the encoder, and the dense vector(s) returned are stored as fields on the document. There could be one or more vectors per-document, as the granularity of the vectors could be per-document, per field, per paragraph, per-sentence, or even per phrase or per term. h2. Query Time: *Encoding:* The query is translated to a dense vector by passing it to the encoder Quantization: The query is quantized. Quantization is the process of taking a vector with many values and turning it into “terms” in a vector space that approximates the full vector space of the dense vectors. *ANN Matching:* A query on the quantized vector tokens is executed as an ANN (approximate nearest neighbor) search. This allows finding most of the best matching documents (typically up to 95%) with a traditional and efficient lookup against the inverted index. (optional) ANN Ranking: ranking may be performed based upon the matched quantized tokens to get a rough, initial ranking of documents based upon the similarity of the query and document vectors. This allows the next step (re-ranking) to be performed on a smaller subset of documents. *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is performed, a similarity calculation (cosine, dot-product, or any number of other calculations) is typically performed between the full (non-quantized) dense vectors for the query and those in the document. This re-ranking will typically be on the top-N results for performance reasons. *Return Results:* As with any search, the final step is typically to return the results in relevance-ranked order. In this case, that would be sorted by the re-ranking similarity score (i.e. “cosine descending”). -- *Variant:* For small document sets, it may be preferable to rank all documents and skip steps steps 2, 3, and 4. This is because ANN Matching typically reduces recall (current state of the art is around 95% recall), so it can be beneficial to rank all documents if performance is not a concern. In this case, step 5 is performed on the full doc set and would obviously just be considered “ranking” instead of “re-”ranking. h1. Proposed Implementation in Solr: h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity * h3. Dense Vector Field: We will add a new dense vector field type in Solr. This field type would be a compressed encoding of a dense vector into a BinaryDocValues Field. There are other ways to do it, but this is almost certain to be the most efficient. Ideally this field is multi-valued. If it is single-valued then we are either limited to only document-level vectors, or otherwise we have to create many vecto
[GitHub] [lucene-solr] madrob commented on issue #1412: Add MinimalSolrTest for scale testing
madrob commented on issue #1412: Add MinimalSolrTest for scale testing URL: https://github.com/apache/lucene-solr/pull/1412#issuecomment-611231598 Would this be better in test-framework as a stub? My goals here are to always have something that I can run against master without needing to recreate this class every time I update my branch or constantly rebasing a patch or whatever. I don't think this makes sense as a JMH bench. We could add trivial assert that the test run times have a total ordering (compare them all in `@AfterClass`)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078792#comment-17078792 ] Dawid Weiss commented on LUCENE-9286: - +1! > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Attachments: screen-[1].png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14370) Refactor bin/solr to allow external override of Jetty modules
[ https://issues.apache.org/jira/browse/SOLR-14370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078806#comment-17078806 ] Andy Throgmorton commented on SOLR-14370: - Sure, I can explain more to solicit alternative solutions. But I understand if this type of use case is something the Solr community doesn't want to encourage/support. We have some code that builds a custom SslContext, and we need the Jetty server to use that during bootstrap. For this purpose, we use a custom jetty.xml file and a custom module that loads/runs this at startup. > Refactor bin/solr to allow external override of Jetty modules > - > > Key: SOLR-14370 > URL: https://issues.apache.org/jira/browse/SOLR-14370 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: scripts and tools >Reporter: Andy Throgmorton >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > The bin/solr script currently does not allow for externally overriding the > modules passed to Jetty on startup. > This PR adds the ability to override the Jetty modules on startup by setting > {{JETTY_MODULES}} as an environment variable; when passed, bin/solr will pass > through (and not clobber) the string verbatim into {{SOLR_JETTY_CONFIG}}. For > example, you can now run: > {{JETTY_MODULES=--module=foo bin/solr start}} > We've added some custom Jetty modules that can be optionally enabled; this > change allows us to keep our logic (regarding which modules to use) in a > separate script, rather than maintaining a forked bin/solr. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mocobeta merged pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects
mocobeta merged pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects URL: https://github.com/apache/lucene-solr/pull/1388 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9278) Make javadoc folder structure follow Gradle project path
[ https://issues.apache.org/jira/browse/LUCENE-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078815#comment-17078815 ] ASF subversion and git services commented on LUCENE-9278: - Commit 4f92cd414c4da6ac6163ff4101b0e07fb94fd067 in lucene-solr's branch refs/heads/master from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4f92cd4 ] LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects (#1388) > Make javadoc folder structure follow Gradle project path > > > Key: LUCENE-9278 > URL: https://issues.apache.org/jira/browse/LUCENE-9278 > Project: Lucene - Core > Issue Type: Task > Components: general/build >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 6h > Remaining Estimate: 0h > > Current javadoc folder structure is derived from Ant project name. e.g.: > [https://lucene.apache.org/core/8_4_1/analyzers-icu/index.html] > [https://lucene.apache.org/solr/8_4_1/solr-solrj/index.html] > For Gradle build, it should also follow gradle project structure (path) > instead of ant one, to keep things simple to manage [1]. Hence, it will look > like this: > [https://lucene.apache.org/core/9_0_0/analysis/icu/index.html] > [https://lucene.apache.org/solr/9_0_0/solr/solrj/index.html] > [1] The change was suggested at the conversation between Dawid Weiss and I on > a github pr: [https://github.com/apache/lucene-solr/pull/1304] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-12890) Vector Search in Solr (Umbrella Issue)
[ https://issues.apache.org/jira/browse/SOLR-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078818#comment-17078818 ] Trey Grainger commented on SOLR-12890: -- After reviewing and testing the code in the patch generously contributed on this issue (thank you [~moshebla]!) and subsequently thinking through the design a lot, I believe there are several limitations to the approach in this current code. Specifically, the use of terms as dimensions in the vector with attached payload is pretty inefficient and won't work well at scale and the use of a query parser is less flexible and reusable than a function query/value source approach would be (in terms of more flexible combination with other functions and use in sorting, returned fields, etc.). Additionally, I think an optimal design would allow for multi-valued vectors (multiple vectors in a field) in order to support things like word embeddings, sentence embeddings, paragraph embeddings, etc., as opposed to only one vector per field in each document, which is challenging to implement with the current approach. Instead of hijacking this Jira and replacing the previous work and design, I've created a new Jira (SOLR-14397) and submitted a new proposed design there, which I plan to work on as next iteration of this Vector Search in Solr initiative. If you're following along with this effort, I'd encourage you to check out SOLR-14397 and provide any feedback on the updated design proposed there. Thanks! > Vector Search in Solr (Umbrella Issue) > -- > > Key: SOLR-12890 > URL: https://issues.apache.org/jira/browse/SOLR-12890 > Project: Solr > Issue Type: New Feature >Reporter: mosh >Priority: Major > > We have recently come across a need to index documents containing vectors > using solr, and have even worked on a small POC. We used an URP to calculate > the LSH(we chose to use the superbit algorithm, but the code is designed in a > way the algorithm picked can be easily chagned), and stored the vector in > either sparse or dense forms, in a binary field. > Perhaps an addition of an LSH URP in conjunction with a query parser that > uses the same properties to calculate LSH(or maybe ktree, or some other > algorithm all together) should be considered as a Solr feature? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects
mocobeta commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects URL: https://github.com/apache/lucene-solr/pull/1388#discussion_r405878977 ## File path: gradle/render-javadoc.gradle ## @@ -15,93 +15,105 @@ * limitations under the License. */ -// generate javadocs by using Ant javadoc task +// generate javadocs by calling javadoc tool +// see https://docs.oracle.com/en/java/javase/11/tools/javadoc.html + +// utility function to convert project path to document output dir +// e.g.: ':lucene:analysis:common' => 'analysis/common' +def pathToDocdir = { path -> path.split(':').drop(2).join('/') } allprojects { plugins.withType(JavaPlugin) { -ext { - javadocRoot = project.path.startsWith(':lucene') ? project(':lucene').file("build/docs") : project(':solr').file("build/docs") - javadocDestDir = "${javadocRoot}/${project.name}" -} - task renderJavadoc { - description "Generates Javadoc API documentation for the main source code. This invokes Ant Javadoc Task." + description "Generates Javadoc API documentation for the main source code. This directly invokes javadoc tool." group "documentation" ext { -linksource = "no" +linksource = false linkJUnit = false -linkHref = [] +linkLuceneProjects = [] +linkSorlProjects = [] } dependsOn sourceSets.main.compileClasspath inputs.files { sourceSets.main.java.asFileTree } - outputs.dir project.javadocRoot + outputs.dir project.javadoc.destinationDir def libName = project.path.startsWith(":lucene") ? "Lucene" : "Solr" def title = "${libName} ${project.version} ${project.name} API".toString() + // absolute urls for "-linkoffline" option + def javaSEDocUrl = "https://docs.oracle.com/en/java/javase/11/docs/api/"; + def junitDocUrl = "https://junit.org/junit4/javadoc/4.12/"; + def luceneDocUrl = "https://lucene.apache.org/core/${project.version.replace(".", "_")}".toString() + def solrDocUrl = "https://lucene.apache.org/solr/${project.version.replace(".", "_")}".toString() + + def javadocCmd = org.gradle.internal.jvm.Jvm.current().getJavadocExecutable() Review comment: I just merged it to the master. > We may have to do something similar to what ES does since we want to be able to run javac, javadocs and tests against new JVMs (which gradle itself may not support yet). Should we open an issue for that, or can it be delayed? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] plperron commented on issue #1330: LUCENE-9267 Replace getQueryBuildTime time unit from ms to ns
plperron commented on issue #1330: LUCENE-9267 Replace getQueryBuildTime time unit from ms to ns URL: https://github.com/apache/lucene-solr/pull/1330#issuecomment-611254162 Should I rebase both commits into a single one in order to keep the cohesiveness ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9267) The documentation of getQueryBuildTime function reports a wrong time unit.
[ https://issues.apache.org/jira/browse/LUCENE-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre-Luc Perron updated LUCENE-9267: -- Attachment: LUCENE-9267.patch > The documentation of getQueryBuildTime function reports a wrong time unit. > -- > > Key: LUCENE-9267 > URL: https://issues.apache.org/jira/browse/LUCENE-9267 > Project: Lucene - Core > Issue Type: Task > Components: modules/other >Affects Versions: 8.2, 8.3, 8.4 >Reporter: Pierre-Luc Perron >Priority: Trivial > Labels: documentation, newbie, pull-request-available > Attachments: LUCENE-9267.patch, LUCENE-9267.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > As per documentation, the > [MatchingQueries|https://lucene.apache.org/core/8_4_1/monitor/org/apache/lucene/monitor/MatchingQueries.html] > class returns both getQueryBuildTime and getSearchTime in milliseconds. The > code shows > [searchTime|https://github.com/apache/lucene-solr/blob/320578274be74a18ce150b604d28a740545fde48/lucene/monitor/src/java/org/apache/lucene/monitor/CandidateMatcher.java#L112] > returning milliseconds. However, the code shows > [buildTime|https://github.com/apache/lucene-solr/blob/320578274be74a18ce150b604d28a740545fde48/lucene/monitor/src/java/org/apache/lucene/monitor/QueryIndex.java#L280] > returning nanoseconds. > The patch changes the documentation of getQueryBuildTime to report > nanoseconds instead of milliseconds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9267) The documentation of getQueryBuildTime function reports a wrong time unit.
[ https://issues.apache.org/jira/browse/LUCENE-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre-Luc Perron updated LUCENE-9267: -- Attachment: (was: LUCENE-9267.patch) > The documentation of getQueryBuildTime function reports a wrong time unit. > -- > > Key: LUCENE-9267 > URL: https://issues.apache.org/jira/browse/LUCENE-9267 > Project: Lucene - Core > Issue Type: Task > Components: modules/other >Affects Versions: 8.2, 8.3, 8.4 >Reporter: Pierre-Luc Perron >Priority: Trivial > Labels: documentation, newbie, pull-request-available > Attachments: LUCENE-9267.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > As per documentation, the > [MatchingQueries|https://lucene.apache.org/core/8_4_1/monitor/org/apache/lucene/monitor/MatchingQueries.html] > class returns both getQueryBuildTime and getSearchTime in milliseconds. The > code shows > [searchTime|https://github.com/apache/lucene-solr/blob/320578274be74a18ce150b604d28a740545fde48/lucene/monitor/src/java/org/apache/lucene/monitor/CandidateMatcher.java#L112] > returning milliseconds. However, the code shows > [buildTime|https://github.com/apache/lucene-solr/blob/320578274be74a18ce150b604d28a740545fde48/lucene/monitor/src/java/org/apache/lucene/monitor/QueryIndex.java#L280] > returning nanoseconds. > The patch changes the documentation of getQueryBuildTime to report > nanoseconds instead of milliseconds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mayya-sharipova commented on issue #1351: LUCENE-9280: Collectors to skip noncompetitive documents
mayya-sharipova commented on issue #1351: LUCENE-9280: Collectors to skip noncompetitive documents URL: https://github.com/apache/lucene-solr/pull/1351#issuecomment-611260306 @romseygeek I have tried to address your outstanding feedback in 4448499f0f. Can you please continue the review when you have time? > Move the logic that checks whether or not to update the iterator into setBottom on the leaf comparator. In the new `FilteringFieldComparator` class, the iterator is updated in - setBottom - when we change a segment in `getLeafComparator`, so that we can also update iterators of subsequent segments. - and also when for the first time queue becomes full and hitsThreshold is reached in `setCanUpdateIterator`, this method is called from a collector. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-611274493 Thanks a lot for your hard work @bruno-roustant This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
dsmiley commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-611276129 Dat can you move out of the numeric package please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)
CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP) URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-611288608 @dsmiley Done! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] noblepaul commented on a change in pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data
noblepaul commented on a change in pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data URL: https://github.com/apache/lucene-solr/pull/1327#discussion_r405920137 ## File path: solr/core/src/java/org/apache/solr/handler/admin/ZkRead.java ## @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.handler.admin; + +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.solr.api.Command; +import org.apache.solr.api.EndPoint; +import org.apache.solr.client.solrj.SolrRequest; +import org.apache.solr.client.solrj.impl.BinaryResponseParser; +import org.apache.solr.common.MapWriter; +import org.apache.solr.common.params.CommonParams; +import org.apache.solr.common.params.MapSolrParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.common.util.ContentStreamBase; +import org.apache.solr.common.util.Utils; +import org.apache.solr.core.CoreContainer; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.response.SolrQueryResponse; +import org.apache.zookeeper.data.Stat; + +import static org.apache.solr.common.params.CommonParams.OMIT_HEADER; +import static org.apache.solr.common.params.CommonParams.WT; +import static org.apache.solr.response.RawResponseWriter.CONTENT; +import static org.apache.solr.security.PermissionNameProvider.Name.COLL_READ_PERM; + +/**Exposes the content of the Zookeeper + * This is an expert feature that exposes the data inside the back end zookeeper.This API may change or + * be removed in future versions. + * This is not a public API. The data that is returned is not guaranteed to remain same + * across releases, as the data stored in Zookeeper may change from time to time. + */ +@EndPoint(path = "/cluster/zk/*", +method = SolrRequest.METHOD.GET, +permission = COLL_READ_PERM) +public class ZkRead { + private final CoreContainer coreContainer; + + public ZkRead(CoreContainer coreContainer) { +this.coreContainer = coreContainer; + } + + @Command + public void get(SolrQueryRequest req, SolrQueryResponse rsp) { +String path = req.getPathTemplateValues().get("*"); +if (path == null || path.isEmpty()) path = "/"; +byte[] d = null; +try { + List l = coreContainer.getZkController().getZkClient().getChildren(path, null, false); + if (l != null && !l.isEmpty()) { +String prefix = path.endsWith("/") ? path : path + "/"; + +rsp.add(path, (MapWriter) ew -> { + for (String s : l) { +try { + Stat stat = coreContainer.getZkController().getZkClient().exists(prefix + s, null, false); + ew.put(s, (MapWriter) ew1 -> { +ew1.put("version", stat.getVersion()); +ew1.put("aversion", stat.getAversion()); +ew1.put("children", stat.getNumChildren()); +ew1.put("ctime", stat.getCtime()); +ew1.put("cversion", stat.getCversion()); +ew1.put("czxid", stat.getCzxid()); +ew1.put("ephemeralOwner", stat.getEphemeralOwner()); +ew1.put("mtime", stat.getMtime()); +ew1.put("mzxid", stat.getMzxid()); +ew1.put("pzxid", stat.getPzxid()); +ew1.put("dataLength", stat.getDataLength()); + }); +} catch (Exception e) { + ew.put("s", Collections.singletonMap("error", e.getMessage())); +} + } +}); + + } else { +d = coreContainer.getZkController().getZkClient().getData(path, null, null, false); +if (d == null || d.length == 0) { + rsp.add(path, null); + return; +} + +Map map = new HashMap<>(1); +map.put(WT, "raw"); +map.put(OMIT_HEADER, "true"); +req.setParams(SolrParams.wrapDefaults(new MapSolrParams(map), req.getParams())); Review comment: no if you are requesting data , you should expect raw data. it will not honour the `wt` param This is an automated message from the Apache Git Service. To res
[GitHub] [lucene-solr] noblepaul commented on a change in pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data
noblepaul commented on a change in pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data URL: https://github.com/apache/lucene-solr/pull/1327#discussion_r405920325 ## File path: solr/core/src/test/org/apache/solr/handler/admin/ZookeeperStatusHandlerTest.java ## @@ -74,6 +78,39 @@ public void tearDown() throws Exception { super.tearDown(); } + @Test + public void testZkread() throws Exception { +URL baseUrl = cluster.getJettySolrRunner(0).getBaseUrl(); +String basezk = baseUrl.toString().replace("/solr", "/api") + "/cluster/zk"; + +try( HttpSolrClient client = new HttpSolrClient.Builder(baseUrl.toString()).build()) { + Object o = Utils.executeGET(client.getHttpClient(), + basezk + "/security.json", + Utils.JSONCONSUMER ); + assertNotNull(o); + o = Utils.executeGET(client.getHttpClient(), + basezk + "/configs", + Utils.JSONCONSUMER ); + assertEquals("0", String.valueOf(getObjectByPath(o,true, split(":/configs:_default:dataLength",':'; + assertEquals("0", String.valueOf(getObjectByPath(o,true, split(":/configs:conf:dataLength",':'; + byte[] bytes = new byte[1024*5]; + for (int i = 0; i < bytes.length; i++) { +bytes[i] = (byte) random().nextInt(128); Review comment: I wanted a big enough `byte[]` not a small one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org