alamb opened a new issue, #23086: URL: https://github.com/apache/datafusion/issues/23086
### Is your feature request related to a problem or challenge? DataFusion's ClickBench benchmark queries currently use `length(...)` for string-length aggregations: - [`benchmarks/queries/clickbench/queries/q27.sql`](https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries/q27.sql) - [`benchmarks/queries/clickbench/queries/q28.sql`](https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries/q28.sql) - [`benchmarks/queries/clickbench/extended/q9.sql`](https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/extended/q9.sql) - [`benchmarks/queries/clickbench/extended/q10.sql`](https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/extended/q10.sql) In DataFusion, [`length`](https://datafusion.apache.org/user-guide/sql/scalar_functions.html#length) is an alias of [`character_length`](https://datafusion.apache.org/user-guide/sql/scalar_functions.html#character_length), which returns the number of characters in a string. DataFusion's [`octet_length`](https://datafusion.apache.org/user-guide/sql/scalar_functions.html#octet_length) returns the length of a string in bytes. This differs from the benchmark semantics used by ClickHouse and DuckDB in ClickBench: - The upstream ClickBench ClickHouse queries use `length(URL)` / `length(Referer)` in Q27 and Q28: [`ClickHouse/ClickBench/clickhouse/queries.sql`](https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/queries.sql#L28-L29). In ClickHouse, `length` is byte-oriented; the ClickHouse string function docs distinguish this from [`lengthUTF8`](https://clickhouse.com/docs/sql-reference/functions/string-functions#lengthutf8), which returns Unicode code points rather than bytes. - The upstream ClickBench DuckDB queries use `STRLEN(URL)` / `STRLEN(Referer)` in Q27 and Q28: [`ClickHouse/ClickBench/duckdb/queries.sql`](https://github.com/ClickHouse/ClickBench/blob/main/duckdb/queries.sql#L28-L29), [`duckdb-parquet/queries.sql`](https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet/queries.sql#L28-L29), and [`duckdb-parquet-partitioned/queries.sql`](https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet-partitioned/queries.sql#L28-L29). DuckDB documents [`strlen(string)`](https://duckdb.org/docs/stable/sql/functions/text#strlenstring) as returning the number of bytes in a string, while [`length(string)`](https://duckdb.org/docs/stable/sql/functions/text#lengthstring) returns the number of characters. Because byte length can be computed from string offsets, while character length must inspect UTF-8 contents, using `octet_length` in these benchmark queries may avoid unnecessary work and improve ClickBench performance without changing the intended benchmark semantics. ### Describe the solution you'd like Update DataFusion's ClickBench SQL benchmark queries to use `octet_length(...)` where the upstream ClickBench / DuckDB benchmark is measuring byte length. Likely changes: - `AVG(length("URL"))` -> `AVG(octet_length("URL"))` in Q27 - `AVG(length("Referer"))` -> `AVG(octet_length("Referer"))` in Q28 - Review the extended ClickBench Q9/Q10 `LENGTH(...)` usages and switch to `octet_length(...)` if those queries should also preserve ClickHouse byte-length semantics. After changing the benchmark SQL, compare ClickBench timings before and after, especially Q27 and Q28. ### Describe alternatives you've considered Leave the queries as-is. That preserves current DataFusion SQL semantics, but likely benchmarks character-counting work that ClickHouse and DuckDB are not doing for Q27 and Q28 in ClickBench. Add a ClickHouse compatibility alias where `length` means byte length. I do not think this is appropriate globally because DataFusion currently documents `length` as an alias of `character_length`. ### Additional context Related ClickBench tracking: - #18489 - #21706 - #22633 - #22804 - #22807 Suggested update to #18489 checklist: ```markdown - [ ] Use `octet_length` instead of `length` in ClickBench SQL where upstream benchmark semantics are byte length: <new issue URL> ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
