Re: [I] investigate jflex 1.9.x buffer size/expansion feature [lucene]

via GitHub Sun, 07 Sep 2025 17:52:52 -0700


rlaehdals commented on issue #14645:
URL: https://github.com/apache/lucene/issues/14645#issuecomment-3264235977


   I have been looking into whether the skeleton.txt files can be removed.
   Since JFlex already provides a built-in skeleton, it seems that 
skeleton.default.txt can safely be deleted.
   
   The main issue is with skeleton.disable.buffer.expansion.txt. Because JFlex 
does not provide a built-in option to disable buffer expansion, the tests 
related to buffer size are currently failing.
   
   As a temporary workaround, I modified the generated code in JFlexTask using 
regex replacements to remove the buffer expansion logic. However, I am not 
certain whether this is the appropriate solution. I would greatly appreciate 
any feedback or suggestions on a better approach.
   
   ```
   configure(project(":lucene:core")) {
     task generateStandardTokenizerInternal(type: JFlexTask) {
       description = "Regenerate StandardTokenizerImpl.java"
       group = "generation"
   
       jflexFile = 
file('src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex')
   
       
       // NOTE: The following modifications in `doLast` are applied
       // after JFlex generates StandardTokenizerImpl.java.
       // These changes adjust buffer handling and error conditions.
       doLast {
         ant.replace(
             file: 
file('src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java'),
             encoding: "UTF-8",
             token: "private static final int ZZ_BUFFERSIZE =",
             value: "private int ZZ_BUFFERSIZE ="
             )
   
         def content = 
file('src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java').text
         content = content.replaceAll(
             /\/\* is the buffer big enough\? \*\/[\s\S]*?(?=\/\* fill the 
buffer with new input \*\/)/,
             ''
         )
   
         content = content.replaceAll(
             /int requested = zzBuffer\.length - zzEndRead;/,
             """int requested = zzBuffer.length - zzEndRead - 
zzFinalHighSurrogate;
       if (requested == 0) {
         return true;
       }"""
         )
   
         content = content.replaceAll(
             /if \(numRead == 0\) \{\s*if \(requested == 0\) 
\{[\s\S]*?\}\s*else \{[\s\S]*?\}\s*\}/,
             """if (numRead == 0) {
         throw new java.io.IOException(
             "Reader returned 0 characters. See JFlex examples/zero-reader for 
a workaround.");
       }"""
         )
   
         content = content.replaceAll(
             /if \(numRead == requested\) \{[\s\S]*?zzFinalHighSurrogate = 
1;[\s\S]*?\}/,
             """if (numRead == requested) { // We requested too few chars to 
encode a full Unicode character
             --zzEndRead;
             zzFinalHighSurrogate = 1;
             if (numRead == 1) {
               return true;
             }
           }"""
         )
         
file('src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java').text
 = content
       }
     }
   
     def generateStandardTokenizer = 
wrapWithPersistentChecksums(generateStandardTokenizerInternal, [
       andThenTasks: [
         "applyGoogleJavaFormat"
       ],
       mustRunBefore: ["compileJava"]
     ])
   
     regenerate.dependsOn generateStandardTokenizer
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] investigate jflex 1.9.x buffer size/expansion feature [lucene]

Reply via email to