Re: [PR] Use Vector API to decode BKD docIds [lucene]
jpountz commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2724038977 I have some small concerns: - The fact that the 512 step is tied to the number of points per leaf, though it's not a big deal at all, postings are similar: their encoding logic is specialized for blocks of 128. I guess I'd just rather err on a smaller block size than 512, which feels larg-ish. - Complexity: the encoding has 3 different sub encodings: 512, 128 and remainder. Could we have only two? But my main concern is more that I would like to better understand why 512 performs so much better. There must be something that happens with this 512 step that doesn't happen otherwise such as using different instructions, loop unrolling, better CPU pipelining or something else. I have some discomfort merging something that is faster without having at least an intuition of why it's faster, so that I can also understand which JVMs and CPUs would enable this speedup. Could pipelining be the reason as 24 (bits per value) * 32 (step) < 2 * 512 (bit width of SIMD instructions)? But then something like 128 should perform well while your benchmark suggests it's still much worse than 512? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
jpountz commented on PR #14333: URL: https://github.com/apache/lucene/pull/14333#issuecomment-2724046501 I started looking at the code but you would know better: does this new encoding make it easier to know the length of leaf blocks while traversing the terms index so that we could prefetch the right byte range when doing terms dictionary lookups? https://github.com/apache/lucene/blob/661dcae3c25cc548a6df251b79b7bfac81c2dba8/lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java#L147-L148 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724314718 Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
DivyanshIITB commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2724394013 Just a gentle reminder -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724062564 I don't know Unicode as well as Rob so I can't say what these alternate case folding equivalence classes are... but they definitely don't have a "canonical" representation with regard to Character.toLowercase. Consider the killer Turkish dotless i, for example: ``` public void testCornerCase() throws Exception { List terms = Stream.of( "aIb", "aıc") .map(s -> { int[] lowercased = s.codePoints().map(Character::toLowerCase).toArray(); return new String(lowercased, 0, lowercased.length); }) .map(LuceneTestCase::newBytesRef) .sorted() .collect(Collectors.toCollection(ArrayList::new)); Automaton a = build(terms, false, true); System.out.println(a.toDot()); assertTrue(a.isDeterministic()); } ``` which yields:  It would take some kind of character normalization filter on both the index and automaton building/expansion side for this to work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r1994867386 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class Trie { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +UNSAVED, +SAVED, +DESTROYED + } + + private static class Node { +private final int label; +private final LinkedList children; +private Output output; +private long fp = -1; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.UNSAVED; + final Node root = new Node(0, null, new LinkedList<>()); + + Trie(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + void putAll(Trie trie) { +if (status != Status.UNSAVED || trie.status != Status.UNSAVED) { + throw new IllegalStateException("tries should be unsaved"); +} +trie.status = Status.DESTROYED; +putAll(this.root, trie.root); + } + + private static void putAll(Node n, Node add) { +assert n.label == add.label; +if (add.output != null) { + n.output = add.output; +} +ListIterator iter = n.children.listIterator(); +// TODO we can do more efficient if there is no intersection, block tree always do that +outer: +for (Node addChild : add.children) { + while (iter.hasNext()) { +Node nChild = iter.next(); +if (nChild.label == addChild.label) { + putAll(nChild, addChild); + continue outer; +} +if (nChild.label > addChild.label) { + iter.previous(); // move back + iter.add(addChild); + continue outer; +} + } + iter.add(addChild); +} + } + + Output getEmptyOutput() { +return root.output; + } + + void forEach(BiConsumer consumer) { +if (root.output != null) { + consumer.accept(new BytesRef(), root.output); +} +intersect(root.children, new BytesRefBuilder(), consumer); + } + + private void intersect( + List nodes, BytesRefBuilder key, BiConsumer consumer) { +for (Node node : nodes) { + key.append((byte) node.label); + if (node.output != null) consumer.accept(key.toBytesRef(), node.output); + intersect(node.children, key, consumer); + key.setLength(key.length() - 1); +} + } + + void save(DataOutput meta, IndexOutput index) throws IOException { +if (status != Status.UNSAVED) { + throw new IllegalStateException("only unsaved trie can be saved"); +} +status = Status.SAVED; +meta.writeVLong(index.getFilePointer()); +saveNodes(index); +meta.writeVLong(root.fp); +index.writeLong(0L); // additional 8 bytes for over-reading +meta.writeVLong(index.getFilePointer()); + } + + void saveNodes(IndexOutput index) throws IOException { +final long startFP = index.getFilePointer(); +Deque stack = new ArrayDeque<>(); +stack.p
Re: [PR] Create vectorized versions of ScalarQuantizer.quantize and recalculateCorrectiveOffset [lucene]
thecoop commented on code in PR #14304: URL: https://github.com/apache/lucene/pull/14304#discussion_r1987194449 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -907,4 +907,87 @@ public static long int4BitDotProduct128(byte[] q, byte[] d) { } return subRet0 + (subRet1 << 1) + (subRet2 << 2) + (subRet3 << 3); } + + @Override + public float quantize( + float[] vector, byte[] dest, float scale, float alpha, float minQuantile, float maxQuantile) { +float correction = 0; +int i = 0; +// only vectorize if we have a viable BYTE_SPECIES we can use for output +if (VECTOR_BITSIZE >= 256) { + for (; i < FLOAT_SPECIES.loopBound(vector.length); i += FLOAT_SPECIES.length()) { +FloatVector v = FloatVector.fromArray(FLOAT_SPECIES, vector, i); + +// Make sure the value is within the quantile range, cutting off the tails +// see first parenthesis in equation: byte = (float - minQuantile) * 127/(maxQuantile - +// minQuantile) +FloatVector dxc = v.min(maxQuantile).max(minQuantile).sub(minQuantile); +// Scale the value to the range [0, 127], this is our quantized value +// scale = 127/(maxQuantile - minQuantile) +// Math.round rounds to positive infinity, so do the same by +0.5 then truncating to int +Vector roundedDxs = dxc.mul(scale).add(0.5f).convert(VectorOperators.F2I, 0); +// output this to the array +((ByteVector) roundedDxs.castShape(BYTE_SPECIES, 0)).intoArray(dest, i); +// We multiply by `alpha` here to get the quantized value back into the original range +// to aid in calculating the corrective offset +Vector dxq = ((FloatVector) roundedDxs.castShape(FLOAT_SPECIES, 0)).mul(alpha); +// Calculate the corrective offset that needs to be applied to the score +// in addition to the `byte * minQuantile * alpha` term in the equation +// we add the `(dx - dxq) * dxq` term to account for the fact that the quantized value +// will be rounded to the nearest whole number and lose some accuracy +// Additionally, we account for the global correction of `minQuantile^2` in the equation +correction += +v.sub(minQuantile / 2f) +.mul(minQuantile) +.add(v.sub(minQuantile).sub(dxq).mul(dxq)) +.reduceLanes(VectorOperators.ADD); Review Comment: And even more with FMA operations -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up scoring conjunctions a bit. [lucene]
jpountz commented on PR #14345: URL: https://github.com/apache/lucene/pull/14345#issuecomment-2724895262 Nightly benchmarks confirmed the speedup: https://benchmarks.mikemccandless.com/FilteredAndHighHigh.html. I'll push an annotation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
jpountz commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2724923272 Apologies I had missed your reply. > should this be a shared global pool across all IndexWriters, or should each writer have its own pool? It should be shared, we don't want the total number of threads to scale with the number of index writers. The reasoning for the numProcessors/2 number is that merging generally should not be more expensive than indexing, so by reserving only half the CPU capacity for merging, it should still be possible to max out hardware while indexing, while also having a peak number of threads running merges under numProcessors/2. > If it's a shared pool, how should we handle cases where a few writers are highly active while others are idle? Should we allow active writers to take more resources dynamically, or keep a strict fixed allocation? Idle writers would naturally submit fewer tasks than highly active writers. IMO the fixed allocation is key here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve DenseConjunctionBulkScorer's sparse fallback. [lucene]
jpountz merged PR #14354: URL: https://github.com/apache/lucene/pull/14354 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725496380 Ok, fair enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] removing constructor with deprecated attribute 'onlyLongestMatch [lucene]
renatoh opened a new pull request, #14356: URL: https://github.com/apache/lucene/pull/14356 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724585337 > Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton). I don't think this is currently a good option either: users won't just do that. They will determinize, minimize, and tableize and then be confused when things are slow or use too much memory. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725736846 It isn't a good idea. If the user wants to "erase case differences" then they should apply `foldcase(ch)`. That's what case-folding means. That CaseFolding class does everything, except, that. Again its why i recommend not messing with it for now and starting simpler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724580292 This is why i recommended to not use the unicode function and to start simple. Then you have a potential way to get it working efficiently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725709282 This is kind of what I had in mind: ```java private static int canonicalize(int codePoint) { int[] alternatives = CaseFolding.lookupAlternates(codePoint); if (alternatives != null) { for (int cp : alternatives) { codePoint = Math.min(codePoint, cp); } } else { int altCase = Character.isLowerCase(codePoint) ? Character.toUpperCase(codePoint) : Character.toLowerCase(codePoint); codePoint = Math.min(codePoint, altCase); } return codePoint; } public void testCornerCase() throws Exception { List terms = Stream.of( "aIb", "aıc") .map(s -> { int[] lowercased = s.codePoints().map(TestStringsToAutomaton::canonicalize).toArray(); return new String(lowercased, 0, lowercased.length); }) .map(LuceneTestCase::newBytesRef) .sorted() .collect(Collectors.toCollection(ArrayList::new)); Automaton a = build(terms, false, true); System.out.println(a.toDot()); assertTrue(a.isDeterministic()); } ``` That produces this automaton, which is minimal and deterministic:  I don't know if that `canonicalize` method is a good idea, though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PointInSetQuery use reverse collection to improve performance [lucene]
hanbj commented on PR #14352: URL: https://github.com/apache/lucene/pull/14352#issuecomment-2724230306 Thank you for providing ideas. In scenarios with multiple dimensions, the internal nodes in the bkd tree can only be sorted according to a certain dimension. Different internal nodes may have different sorting dimensions, which is indeed difficult to implement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Vector API to decode BKD docIds [lucene]
jpountz commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2726015514 Thanks for running benchmarks. So it looks like the JVM doesn't think these shorter loops (with step 128) are worth unrolling? This makes me wonder how something like that performs on your AVX-512 CPU. I think you had something similar in one of your previous iterations. On my machine it's on par with the current version. ```java private void readInts24(IndexInput in, int count, int[] docIDs) throws IOException { if (count == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE) { // Same format, but enabling the JVM to specialize the decoding logic for the default number // of points per node proved to help on benchmarks doReadInts24(in, 512, docIDs); } else { doReadInts24(in, count, docIDs); } } private void doReadInts24(IndexInput in, int count, int[] docIDs) throws IOException { // Read the first (count - count % 4) values int quarter = count >> 2; int numBytes = quarter * 3; in.readInts(scratch, 0, numBytes); for (int i = 0; i < numBytes; ++i) { docIDs[i] = scratch[i] >>> 8; scratch[i] &= 0xFF; } for (int i = 0; i < quarter; ++i) { docIDs[numBytes + i] = scratch[i] | (scratch[quarter + i] << 8) | (scratch[2 * quarter + i] << 16); } // Now read the remaining 0, 1, 2 or 3 values for (int i = quarter << 2; i < count; ++i) { docIDs[i] = (in.readShort() & 0x) | (in.readByte() & 0xFF) << 16; } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Vector API to decode BKD docIds [lucene]
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2725390772 > There must be something that happens with this 512 step that doesn't happen otherwise such as using different instructions, loop unrolling, better CPU pipelining or something else. Thanks for pointing out this. I studied the asm profile again and i can see at least loop unrolling differs there. According to the asm printed by jmh, i can see for bpv24 decoding: * VectorAPI unrolled shift loop x8 (add 0x40 once) and remainder loop x4 (add 0x20 once) * InnerLoop 512 step unrolled shift loop x4 (add 0x20 once) and remainder loop x2 (add 0x10 once) * InnerLoop 128 step does not get loop unrolling for either shift loop (add 0x8 once) or remainder loop (add 0x8 once). This is corresponding to the result of jmh: vector API > InnerLoop step-512 > InnerLoop step-128. Things might change in luceneutil because we find InnerLoop step-512 faster than Vector API there. I confirmed the result of luceneutil of step-512(baseline) vs step-128(candidate): ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value FilteredIntNRQ 80.02 (4.0%) 71.31 (3.0%) -10.9% ( -17% - -4%) 0.000 IntNRQ 80.94 (2.5%) 72.60 (3.6%) -10.3% ( -16% - -4%) 0.000 CountFilteredIntNRQ 42.93 (2.9%) 40.22 (2.3%) -6.3% ( -11% - -1%) 0.001 IntSet 93.36 (2.1%) 93.85 (0.7%)0.5% ( -2% -3%) 0.633 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2726097192 Hmm... I'm thinking of just requiring that input is lowercase (per `Character.lowerCase(c)`), then check for collisions on uppercase versions when adding transitions, and throw an exception (since it won't be a DFA). Unfortunately, that would mess with Turkish, if someone tries searching for sınıf (class) and sinirli (nervous). Without locale info, we'd get two transitions from s to I. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org