Re: [PR] feat: Support Spark levenshtein expression in native execution [datafusion-comet]

via GitHub Tue, 28 Apr 2026 03:10:37 -0700


Myx778 commented on PR #4105:
URL: 
https://github.com/apache/datafusion-comet/pull/4105#issuecomment-4334273434


   ## Benchmark Results
   
   Ran `CometStringExpressionBenchmark` on GitHub Codespace (4-core AMD EPYC 
7763, 16GB RAM, Ubuntu 22.04, JDK 17.0.18).
   
   ### levenshtein (2-arg)
   
   ```
   OpenJDK 64-Bit Server VM 17.0.18+8-Ubuntu-124.04.1 on Linux 6.8.0-1044-azure
   AMD EPYC 7763 64-Core Processor
   levenshtein:                              Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   Spark                                                21             24       
    3          0.0       20689.0       1.0X
   Comet (Scan)                                         22             24       
    2          0.0       21926.5       0.9X
   Comet (Scan + Exec)                                  22             25       
    3          0.0       21696.1       1.0X
   ```
   
   ### levenshtein_threshold (3-arg)
   
   ```
   OpenJDK 64-Bit Server VM 17.0.18+8-Ubuntu-124.04.1 on Linux 6.8.0-1044-azure
   AMD EPYC 7763 64-Core Processor
   levenshtein_threshold:                    Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   Spark                                                16             18       
    2          0.1       15537.4       1.0X
   Comet (Scan)                                         17             20       
    3          0.1       16607.5       0.9X
   Comet (Scan + Exec)                                  17             19       
    3          0.1       16605.2       0.9X
   ```
   
   ### Analysis
   
   Performance is roughly on par with Spark for both variants (~1.0X). This is 
expected because:
   1. The benchmark dataset is small (1024 rows with short repeated strings), 
so the overhead of JVM↔Native transitions dominates the actual computation time
   2. `levenshtein` is a character-by-character comparison — the real gains 
from native execution would show on larger datasets where SIMD 
auto-vectorization and cache-efficient memory layout have more impact
   3. The `WARNING: Benchmark plan is NOT fully Comet native!` shows the 
expression falls back through `ColumnarToRow`, which adds conversion overhead
   
   The key benefit is **correctness** — ensuring the native path produces 
identical results to Spark, which is validated by the unit tests and SLT tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Support Spark levenshtein expression in native execution [datafusion-comet]

Reply via email to