kaori-seasons opened a new issue, #60951: URL: https://github.com/apache/doris/issues/60951
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Description ## 1. Background & Objectives ### 1.1 Migration Context Snowflake is one of the mainstream cloud data warehouses. Many enterprises need to migrate from Snowflake to Apache Doris. The primary blockers encountered during actual migration are: - **Incompatible function names**: e.g., `IFF`, `LEN`, `CHARINDEX` have no corresponding names in Doris - **Subtle semantic differences**: e.g., byte-length vs. character-length, different parameter ordering, NULL-handling divergence - **Syntax-level differences**: e.g., `LATERAL FLATTEN`, `QUALIFY`, `expr::type` — Snowflake-specific syntax ### 1.2 Design Goals - **G-1**: Without introducing a Snowflake dialect plugin, achieve **>90%** high-frequency Snowflake function compatibility through FE aliases + expression rewriting + a small number of new functions - **G-2**: Avoid unnecessary BE changes; prioritize FE-layer semantic mapping - **G-3**: Organize compatibility implementation as "decomposable, small-step-fast-iteration" PR units for easy community review - **G-4**: All function mappings verified through **signature-level validation + simulated calculations** to prevent subtle semantic errors - **G-5**: Fully respect current Doris semantics — **must not alter existing function behavior** --- ## 2. Technical Infrastructure Analysis ### 2.1 Function Registration System Doris FE uses `BuiltinScalarFunctions.java` to register functions via the `scalar(Class, "name", "alias1", ...)` pattern: ```java // Example: Cardinality class registered with three names scalar(Cardinality.class, "array_size", "cardinality", "size") ``` **Function name case-insensitivity**: The `FunctionHelper.NamedFunc` constructor (L38-42) applies `String::toLowerCase` to all registered names, so `LEN`/`len`/`Len` are equivalent during lookup. **FE constructors hardcode BE function names**: e.g., `Nvl.java`'s `super("ifnull", ...)` means FE aliases only affect name lookup during the Parser resolution phase. The function name sent to BE is always the hardcoded name in the constructor (e.g., `"ifnull"`), regardless of what the user writes in SQL. ### 2.2 Current Registered Function Status (Snowflake-Related) Verified against actual `BuiltinScalarFunctions.java` code: | Current Registration | Notes | |---|---| | `scalar(If.class, "if")` | No `"iff"` | | `scalar(Array.class, "array")` | No `"array_construct"` | | `scalar(ArrayConcat.class, "array_concat")` | No `"array_cat"` | | `scalar(ArrayJoin.class, "array_join")` | No `"array_to_string"` | | `scalar(Cardinality.class, "array_size", "cardinality", "size")` | `ARRAY_SIZE` **already works** | | `scalar(ArrayContains.class, "array_contains")` | **Already works** | | `scalar(ArrayDistinct.class, "array_distinct")` | **Already works** | | `scalar(JsonbParse.class, "jsonb_parse", "json_parse")` | No `"parse_json"` | | `scalar(JsonbParseErrorToNull.class, "jsonb_parse_error_to_null", "json_parse_error_to_null")` | No `"try_parse_json"` | | `scalar(JsonbValid.class, "json_valid")` | **Code quality issue: same Class registered in two separate lines** | | `scalar(JsonbValid.class, "jsonb_valid")` | Should be merged into one line | | `scalar(Hex.class, "hex")` | No `"hex_encode"` | | `scalar(Unhex.class, "unhex")` | No `"hex_decode_string"` | | `scalar(ToBase64.class, "to_base64")` | No `"base64_encode"` | | `scalar(FromBase64.class, "from_base64")` | No `"base64_decode_string"` | | `scalar(EndsWith.class, "ends_with")` | No `"endswith"` | | `scalar(StartsWith.class, "starts_with")` | No `"startswith"` | | `scalar(Now.class, "now", "current_timestamp", "localtime", "localtimestamp")` | No `"getdate"` | | `scalar(CharacterLength.class, "char_length", "character_length")` | No `"len"` | | `scalar(Length.class, "length", "octet_length")` | **Byte** length, not character length | | `scalar(Locate.class, "position", "locate")` | No `"charindex"` | | `scalar(Instr.class, "instr")` | Parameter order **reversed** from Snowflake `CHARINDEX` | | `scalar(ConvertTz.class, "convert_tz")` | Parameter order differs from Snowflake | | `scalar(SplitByString.class, "split_by_string", "split")` | **`SPLIT` registered but follows ClickHouse parameter convention** | | `scalar(Nvl.class, "ifnull", "nvl")` | `NVL` already works | ### 2.3 Function Signature Verification Results Verified by reading FE Java source constructors one by one: | FE Class | Constructor | Parameter Semantics | BE Function Name | |---|---|---|---| | `Locate.class` | `super("locate", arg0, arg1)` | `locate(substr, str[, pos])` | `locate` | | `Instr.class` | `super("instr", arg0, arg1)` | `instr(str, substr)` | `instr` | | `Length.class` | `super("length", arg)` | Returns **byte count** | `length` | | `CharacterLength.class` | `super("character_length", arg)` | Returns **character count** | `character_length` | | `ConvertTz.class` | `super("convert_tz", castDateTime(arg0), arg1, arg2)` | `(ts, src_tz, dst_tz)` | `convert_tz` | | `SplitByString.class` | `super("split_by_string", arg0, arg1)` | `(delimiter, string)` | `split_by_string` | | `EndsWith.class` | `super("ends_with", arg0, arg1)` | `(str, suffix)` | `ends_with` | | `StartsWith.class` | `super("starts_with", arg0, arg1)` | `(str, prefix)` | `starts_with` | ### 2.4 RewriteWhenAnalyze Rewrite Mechanism Doris FE provides the `RewriteWhenAnalyze` interface for statically rewriting one function expression into another during the analysis phase. Currently 22 implementation classes exist, primarily in JSON functions and the Explode family: - `JsonObject`: auto-wraps value arguments with `ToJson` - `JsonArray` / `JsonArrayIgnoreNull` / `JsonReplace` / `JsonSet`, etc. - `ExplodeSplit` / `ExplodeJsonArray*` family **Key constraint**: `RewriteWhenAnalyze` only performs **static expression tree transformations** and cannot make dynamic decisions based on runtime data (e.g., whether a value is NULL). ### 2.5 Dialect Framework The `Dialect` enum currently contains `DORIS/TRINO/PRESTO/SPARK/SPARK2/FLINK/HIVE/POSTGRES/SQLSERVER/CLICKHOUSE/ORACLE/STARROCKS` — **`SNOWFLAKE` is not present**. Dialect plugins (e.g., `TrinoDialectConverterPlugin`) only act at the `parseSqlWithDialect()` stage, performing syntax-level SQL text rewriting before Nereids Parser parsing. Function name/parameter semantic level compatibility **does not depend on** dialect plugins. ### 2.6 Table-Generating Functions `BuiltinTableGeneratingFunctions` currently contains the `explode` family, `explode_json_array_*` family, `explode_json_object` family, etc. It **does not contain** a table-generating function semantically equivalent to Snowflake's `FLATTEN`. --- ## 3. Technical Constraints & Design Principles ### 3.1 Hard Constraints | ID | Constraint | Impact | |---|---|---| | C-1 | Function name case-insensitivity | All registered names lowercased; `LEN`/`len` equivalent | | C-2 | FE constructors hardcode BE function names | FE aliases only affect Parser, not BE | | C-3 | RewriteWhenAnalyze supports static transformations only | Cannot dynamically remove JSON keys based on runtime values | | C-4 | No VARIANT type | Can only approximate Snowflake VARIANT with JSON | | C-5 | Must not change existing function behavior | Cannot modify existing `SPLIT`/`length` semantics | | C-6 | Same Class must not be registered in multiple lines | `JsonbValid`'s two-line registration needs merging | ### 3.2 Design Trade-offs | Trade-off | Decision | Rationale | |---|---|---| | Dialect plugin vs. function-level aliases | **Prioritize function-level aliases** | Dialect plugin is for syntax-level needs (`LATERAL FLATTEN`, `QUALIFY`); should not block function compatibility | | Perfect semantics vs. approximate semantics | **Phased**: V1 approximate, V2 precise | `OBJECT_CONSTRUCT`'s NULL-skip semantics cannot be implemented via FE static rewriting | | Batch PRs vs. single-function PRs | **Batch by category** | Pure aliases can be handled in one PR for 16 functions — efficient and easy to review | | `SPLIT` parameter order conflict | **Preserve existing behavior + document** | Existing `SPLIT` follows ClickHouse convention (delimiter first); changing it would break existing users | --- ## 4. Key Function Signature Verification (Simulated Calculations) ### 4.1 CHARINDEX Must Map to Locate, Not Instr **Signature comparison**: - Snowflake: `CHARINDEX(substr, str[, start])` — substring first, target string second - Doris `Instr.class`: `instr(str, substr)` — target string first, substring second (**reversed**) - Doris `Locate.class`: `locate(substr, str[, pos])` — substring first, target string second (**matches**) **Simulated calculation**: ```sql -- Original Snowflake statement: SELECT CHARINDEX('cd', 'abcdef') AS pos; -- Correct semantics: find 'cd' in 'abcdef', result should be 3 -- If incorrectly mapped to instr (reversed parameter order): SELECT instr('cd', 'abcdef') AS pos; -- Actual meaning: find 'abcdef' in 'cd' → returns 0 ❌ -- Correct mapping to locate (matching parameter order): SELECT locate('cd', 'abcdef') AS pos; -- returns 3 ✅ ``` **Conclusion**: `CHARINDEX` must be registered as an alias for `Locate.class`. ### 4.2 LEN Must Map to CharacterLength, Not Length **Semantic difference**: - Snowflake `LEN(str)`: returns **character count** - Doris `Length.class` (`length`/`octet_length`): returns **byte count** - Doris `CharacterLength.class` (`char_length`/`character_length`): returns **character count** **Simulated calculation (UTF-8 multi-byte characters)**: ```sql -- Snowflake: SELECT LEN('你好'); -- returns 2 (2 characters) -- If mapped to length (bytes): SELECT length('你好'); -- Doris returns 6 (3 bytes per Chinese character) ❌ -- Correct mapping to char_length (characters): SELECT char_length('你好'); -- returns 2 ✅ ``` **Conclusion**: `LEN` must be registered as an alias for `CharacterLength.class`. ### 4.3 CONVERT_TIMEZONE Cannot Be a Pure Alias — Requires Parameter Reordering **Signature comparison**: - Snowflake: `CONVERT_TIMEZONE(source_tz, target_tz, timestamp)` — timezones first - Doris `ConvertTz.class`: `convert_tz(timestamp, source_tz, target_tz)` — timestamp first **Simulated calculation**: ```sql -- Snowflake: SELECT CONVERT_TIMEZONE('America/New_York', 'Asia/Shanghai', '2024-01-01 12:00:00'); -- Convert New York time to Shanghai time, result ≈ '2024-01-02 01:00:00' -- If used directly as an alias (no parameter reordering): SELECT convert_tz('America/New_York', 'Asia/Shanghai', '2024-01-01 12:00:00'); -- ts='America/New_York' (invalid timestamp), src='Asia/Shanghai', dst='...' → NULL or error ❌ ``` **Conclusion**: `CONVERT_TIMEZONE` must be implemented via FE expression rewriting for parameter reordering: `(src, dst, ts)` → `(ts, src, dst)`. ### 4.4 SPLIT Has a Parameter Order Conflict (Known Risk) **Current status**: `SplitByString.class` has `"split"` registered as an alias, with parameter order `split_by_string(delimiter, string)` — **ClickHouse convention** (delimiter first). **Conflict**: Snowflake's `SPLIT(string, delimiter)` has the **exact opposite** parameter order (string first). **Simulated calculation**: ```sql -- What a Snowflake user writes: SELECT SPLIT('hello,world', ','); -- Expected result: ['hello', 'world'] -- Current Doris behavior: SELECT SPLIT('hello,world', ','); -- Actual meaning: split_by_string('hello,world', ',') -- Treats 'hello,world' as delimiter, ',' as the string to split → unexpected result ❌ ``` **Constraint analysis**: - Per G-5 (must not change existing function behavior), the existing `SPLIT` alias parameter order cannot be modified - Snowflake-semantics `SPLIT` can only be differentiated at the **dialect level** (`sql_dialect = 'snowflake'`) **Conclusion**: The `SPLIT` parameter conflict is a known limitation. Short-term: document the difference; long-term: handle parameter reordering in the dialect plugin. ### 4.5 OBJECT_CONSTRUCT FE Static Rewriting Is Infeasible Key semantics of Snowflake `OBJECT_CONSTRUCT`: **when a value is NULL, skip that key**. Doris `JsonObject`'s `rewriteWhenAnalyze` only wraps values with `ToJson` and cannot dynamically remove keys. Implementing NULL-skipping via FE static rewriting requires enumerating all NULL combinations of values, yielding `2^N` branches: | Key-Value Pair Count N | NULL Combinations to Enumerate | Feasibility | |---|---|---| | 2 | 4 | Barely acceptable | | 5 | 32 | Very complex | | 10 | 1,024 | Unacceptable | | 20 | 1,048,576 | Impossible | **Conclusion**: Adopt a **V1 approximation (retain `"key": null`) + V2 new BE function** two-phase strategy. ### 4.6 FLATTEN Cannot Reuse array_flatten | Dimension | Doris `array_flatten` | Snowflake `FLATTEN` | |---|---|---| | Type | Scalar function | Lateral Table Function | | Input | `ARRAY<ARRAY<T>>` | VARIANT / ARRAY / OBJECT | | Output | `ARRAY<T>` (single row, single column) | Multi-row, multi-column (SEQ, KEY, PATH, INDEX, VALUE, THIS) | | Parameters | `input` only | `input` + `path/outer/recursive/mode` | **Conclusion**: An independent TVF implementation is required. --- ## 5. Function Landscape Classification Plan ### 5.1 Classification Dimensions | Category | Definition | Effort | |---|---|---| | **Tier 0: Already Fully Compatible** | Function name/semantics/parameter order all match | Zero | | **Tier 1: Pure Aliases** | Only FE alias registration needed | Very low | | **Tier 2: FE Expression Rewriting** | Rewrite via `RewriteWhenAnalyze`, zero BE changes | Low | | **Tier 3: New Scalar Functions** | Requires FE + BE new implementation | Medium-High | | **Tier 4: New Table Function** | `FLATTEN` TVF, FE + BE | High | | **Tier 5: Syntax-Level Compatibility** | Depends on dialect plugin | Deferred | ### 5.2 Tier 0: Already Fully Compatible (Zero Effort) The following functions in Doris already have highly consistent semantics with Snowflake — no development work needed: | Category | Functions | |---|---| | CAST family | `TRY_CAST` | | NULL handling | `NULLIF`, `NVL` (`ifnull` alias), `COALESCE` | | Boolean aggregation | `BOOLAND_AGG` / `BOOLOR_AGG` / `BOOLXOR_AGG` | | Regression functions | `REGR_SLOPE` / `REGR_INTERCEPT` | | String operations | `REGEXP_REPLACE`, `REGEXP_SUBSTR` (approximate), `LEFT` / `RIGHT` / `LPAD` / `RPAD` / `TRIM` / `REPLACE` | | Array operations | `ARRAY_SIZE` / `ARRAY_DISTINCT` / `ARRAY_CONTAINS` (works directly due to case-insensitivity) | ### 5.3 Tier 1: Pure Aliases (16 items) Map Snowflake function names to existing Doris function names via FE-layer aliases, requiring no BE changes. All verified through signature comparison and simulated calculation. | Snowflake Function | Doris Mapping Class | FE Registration | Verification Notes | |---|---|---|---| | `IFF(cond,t,f)` | `If.class` | `scalar(If.class, "if", "iff")` | Semantics fully identical | | `ARRAY_CONSTRUCT(...)` | `Array.class` | Append `"array_construct"` | Array construction | | `ARRAY_CAT(a1,a2)` | `ArrayConcat.class` | Append `"array_cat"` | Semantics identical | | `ARRAY_TO_STRING(arr,sep)` | `ArrayJoin.class` | Append `"array_to_string"` | Semantics identical | | `PARSE_JSON(str)` | `JsonbParse.class` | Append `"parse_json"` | Returns JSON; VARIANT difference documented | | `TRY_PARSE_JSON(str)` | `JsonbParseErrorToNull.class` | Append `"try_parse_json"` | Returns NULL on error | | `CHECK_JSON(str)` | `JsonbValid.class` | Merge into `"json_valid", "jsonb_valid", "check_json"` | Also fixes duplicate registration | | `BASE64_ENCODE(str)` | `ToBase64.class` | Append `"base64_encode"` | | | `BASE64_DECODE_STRING(str)` | `FromBase64.class` | Append `"base64_decode_string"` | | | `HEX_ENCODE(str)` | `Hex.class` | Append `"hex_encode"` | | | `HEX_DECODE_STRING(str)` | `Unhex.class` | Append `"hex_decode_string"` | | | `ENDSWITH(str,sfx)` | `EndsWith.class` | Append `"endswith"` | `(str, suffix)` order matches | | `STARTSWITH(str,pfx)` | `StartsWith.class` | Append `"startswith"` | `(str, prefix)` order matches | | `GETDATE()` | `Now.class` | Append `"getdate"` | Returns current timestamp | | `LEN(str)` | **`CharacterLength.class`** | Append `"len"` | **Character length**, not byte length | | `CHARINDEX(sub,str[,pos])` | **`Locate.class`** | Append `"charindex"` | **Parameter order matches**, not Instr | ### 5.4 Tier 2: FE Expression Rewriting (7 items) Rewrite into combinations of existing functions via FE-layer `RewriteWhenAnalyze`, achieving zero BE changes. | Snowflake Function | Rewrite Target | Rewrite Logic | Notes | |---|---|---|---| | `ZEROIFNULL(x)` | `COALESCE(x, 0)` | `→ new Coalesce(child(0), new IntegerLiteral(0))` | Return type follows x | | `NULLIFZERO(x)` | `NULLIF(x, 0)` | `→ new NullIf(child(0), new IntegerLiteral(0))` | Behavior identical | | `NVL2(e1,e2,e3)` | `IF(e1 IS NOT NULL, e2, e3)` | `→ new If(new Not(new IsNull(child(0))), child(1), child(2))` | Independent from `Nvl` (2-arg ifnull) | | `EQUAL_NULL(a,b)` | `a <=> b` | `→ new NullSafeEqual(child(0), child(1))` | Leverages Doris `<=>` operator | | `CONVERT_TIMEZONE(src,dst,ts)` | `convert_tz(ts,src,dst)` | `→ new ConvertTz(child(2), child(0), child(1))` | Parameter reordering | | `TO_VARCHAR(expr[,fmt])` | `date_format` / `CAST` | With `fmt` → `date_format`; without `fmt` → `CAST AS VARCHAR` | Format specifier differences require documentation | | `OBJECT_CONSTRUCT(k,v,...)` (V1) | `json_object(k,v,...)` | `→ new JsonObject(children...)` | **V1 approximation**: does NOT skip NULL keys | > **SPLIT is excluded from this Tier**: The existing `SPLIT` alias follows ClickHouse convention (delimiter first), which conflicts with Snowflake (string first). Modifying it would violate G-5; must be handled in the dialect plugin (see Section 7 — Known Risks). > **OBJECT_CONSTRUCT special note**: V1 only achieves "approximate semantics" (retains `"key": null`) and cannot replicate Snowflake's "skip keys whose value is NULL" behavior. The difference must be explicitly documented in migration guides. V2 precise implementation is planned in Tier 3. ### 5.5 Tier 3: New Scalar Functions (6 items) | Function | Type | Key Implementation Points | |---|---|---| | `STRTOK(str,delim,partnum)` | FE + BE new impl | `delim` is a **character set** (each character is an independent delimiter), different from `split_part`'s string delimiter | | `STRTOK_TO_ARRAY(str,delim)` | FE + BE new impl | Returns `ARRAY<VARCHAR>`, shares BE splitting logic with `STRTOK` | | `DATE_FROM_PARTS(y,m,d)` | FE + BE new impl | Similar to `makedate` but with (year, month, day) vs. `makedate(year, day_of_year)` | | `TIMESTAMP_FROM_PARTS(...)` | FE + BE new impl | Multiple overloads: supports nanosecond and timezone parameters | | `TYPEOF(expr)` | FE-level impl | Returns expression type name string; can leverage `getDataType().toSql()` + constant folding | | `OBJECT_CONSTRUCT` (V2) | BE new impl | Precisely implements Snowflake's dynamic NULL-value-key-skipping semantics | ### 5.6 Tier 4: New Table Function — FLATTEN - FE: Add `Flatten.java` / `FlattenOuter.java`, extending `TableGeneratingFunction` - BE: Add `vflatten_table_function.{h,cpp}` - Output 6 columns: `SEQ` (sequence), `KEY` (key name), `PATH` (path), `INDEX` (array index), `VALUE` (value), `THIS` (current level) - V1 supports single-parameter `FLATTEN(input)` only, handling ARRAY/JSON input - Later iterations add `path/outer/recursive/mode` parameter support ### 5.7 Tier 5: Syntax-Level Compatibility (Depends on Dialect Plugin) | Syntax | Description | Implementation | |---|---|---| | `LATERAL FLATTEN(...)` | Lateral unnesting | `SnowflakeDialectConverterPlugin` SQL text-level transformation | | `QUALIFY` clause | Window function filtering | Rewrite as subquery + `WHERE` | | `expr::type` | Type cast syntax | Rewrite as `CAST(expr AS type)` | | `SPLIT(str, delim)` | Parameter order adaptation | Swap parameters under dialect mode, then delegate to `split_by_string` | These require adding `SNOWFLAKE("snowflake")` to the `Dialect` enum and creating a `fe_plugins/snowflake-converter/` module, referencing `TrinoDialectConverterPlugin`. Downgraded to P3 priority — does not block function-level compatibility work. ### Solution _No response_ ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
