john-wagster commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1949691308
########## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFoldingUtil.java: ########## @@ -0,0 +1,338 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.automaton; + +import java.io.IOException; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.StandardOpenOption; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashMap; +import java.util.HashSet; +import java.util.HexFormat; +import java.util.LinkedList; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Queue; +import java.util.Set; + +/** + * Generates the case folding set of alternate character mappings based on this spec: + * https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt Review Comment: So relying on `SpecialCasing.txt` by itself is not going to work. It lacks several of the more common mappings like: ``` 0x1C84 # CYRILLIC SMALL LETTER TALL TE 0x1C85 # CYRILLIC SMALL LETTER THREE-LEGGED TE 0x0422 # CYRILLIC CAPITAL LETTER TE 0x0442 # CYRILLIC SMALL LETTER TALL TE ``` I can combine `SpecialCasing.txt` with `UnicodeData.txt` or `CaseFolding.txt` but it really doesn't buy us anything. `SpecialCasing.txt` is just missing information about all of the "simple" folds. I could take `CaseFolding.txt` and then filter it by what `Character.toUpperCase` and `Character.toLowerCase` can map to but I was thinking about this and it strikes me that I would be doing that simply to reduce the map size from 24kb, which seems pretty small to me to begin with. And then at runtimes we'd be calling `Character.toUpperCase` and `Character.toLowerCase` to undo that and get those mappings back. I can see an argument that there's an upside to not maintaining the mappings that "Java already supports" but then it seems painful to know what version of the Unicode spec we actually support. If we build the mapping from `CaseFolding.txt` only it's simpler logic at regex build time and easier to document and say we support version 16.0 of the Unicode case folding spec. I c an and already plan to update the `casefolding.txt` file with comments about which characters are folding. So I'm not sure it's a huge difference between hard coding it in a Java file vs loading via a text file. If there was a single source like we didn't need to generate the mapping and could read from say `CaseFolding.txt` or `SpecialCasing.txt` it would make more sense to hard code it to me, but I'm a little concerned it would be hard to see edits to the spec reflected appropriately without maintaining some of kind utility class like what I currently have as `CaseFoldingUtil`. But if you want to push back hard on some of this I don't mind building a utility as a one off that outputs a bunch of case statements or Map Entry lines instead of a checked in utility that outputs a bunch of text lines. The two seem pretty equivalent to me. It seems nice to maintain the mapping utility in Lucene itself imo. I'd prefer to get something in for Unicode support rather than nothing thoug h. And particularly for my use case here (I'm trying to clean up and unify regex in Elasticsearch) it would be nice to expose a `CaseFolding` class that I can reference in ES instead of say using icu4j in ES and Lucene having a potentially slightly different opinion. Lastly if the map size of 24kb is concerning I can also conditionally load that only when the flag is set. I'm kind of thinking this is a good idea anyway but didn't want to add unnecessary complexity. To be confident about some of this I went and have code to do much of it locally. I can show you the differences between using `SpecialCasing.txt` vs all of `CaseFolding.txt` vs what I did original which is evaluate all of the matches that Pattern pulls back. Let me know if that would be helpful to review one of those and or see some snippets in comments here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org