Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

via GitHub Mon, 10 Feb 2025 10:43:22 -0800


john-wagster commented on code in PR #14192:
URL: https://github.com/apache/lucene/pull/14192#discussion_r1949691308



##########
lucene/core/src/java/org/apache/lucene/util/automaton/CaseFoldingUtil.java:
##########
@@ -0,0 +1,338 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.automaton;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.StandardOpenOption;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.HexFormat;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Queue;
+import java.util.Set;
+
+/**
+ * Generates the case folding set of alternate character mappings based on 
this spec:
+ * https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt

Review Comment:
   So relying on `SpecialCasing.txt` by itself is not going to work.  It lacks 
several of the more common mappings like: 
   
   ```
   0x1C84 # CYRILLIC SMALL LETTER TALL TE
   0x1C85 # CYRILLIC SMALL LETTER THREE-LEGGED TE
   0x0422 # CYRILLIC CAPITAL LETTER TE
   0x0442 # CYRILLIC SMALL LETTER TALL TE
   ```
   
   I can combine `SpecialCasing.txt` with `UnicodeData.txt` or 
`CaseFolding.txt` but it really doesn't buy us anything.  `SpecialCasing.txt` 
is just missing information about all of the "simple" folds.  I could take 
`CaseFolding.txt` and then filter it by what `Character.toUpperCase` and 
`Character.toLowerCase` can map to but I was thinking about this and it strikes 
me that I would be doing that simply to reduce the map size from 24kb, which 
seems pretty small to me to begin with.  And then at runtimes we'd be calling 
`Character.toUpperCase` and `Character.toLowerCase` to undo that and get those 
mappings back.  I can see an argument that there's an upside to not maintaining 
the mappings that "Java already supports" but then it seems painful to know 
what version of the Unicode spec we actually support.  If we build the mapping 
from `CaseFolding.txt` only it's simpler logic at regex build time and easier 
to document and say we support version 16.0 of the Unicode case folding spec.  
I c
 an and already plan to update the `casefolding.txt` file with comments about 
which characters are folding.  So I'm not sure it's a huge difference between 
hard coding it in a Java file vs loading via a text file.  If there was a 
single source like we didn't need to generate the mapping and could read from 
say `CaseFolding.txt` or `SpecialCasing.txt` it would make more sense to hard 
code it to me, but I'm a little concerned it would be hard to see edits to the 
spec reflected appropriately without maintaining some of kind utility class 
like what I currently have as `CaseFoldingUtil`.  But if you want to push back 
hard on some of this I don't mind building a utility as a one off that outputs 
a bunch of case statements or Map Entry lines instead of a checked in utility 
that outputs a bunch of text lines.  The two seem pretty equivalent to me.  It 
seems nice to maintain the mapping utility in Lucene itself imo.  I'd prefer to 
get something in for Unicode support rather than nothing thoug
 h.  And particularly for my use case here (I'm trying to clean up and unify 
regex in Elasticsearch) it would be nice to expose a `CaseFolding` class that I 
can reference in ES instead of say using icu4j in ES and Lucene having a 
potentially slightly different opinion.  
   
   Lastly if the map size of 24kb is concerning I can also conditionally load 
that only when the flag is set.  I'm kind of thinking this is a good idea 
anyway but didn't want to add unnecessary complexity.
   
   To be confident about some of this I went and have code to do much of it 
locally.  I can show you the differences between using `SpecialCasing.txt` vs 
all of `CaseFolding.txt` vs what I did original which is evaluate all of the 
matches that Pattern pulls back.  Let me know if that would be helpful to 
review one of those and or see some snippets in comments here.    
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

Reply via email to