This is an automated email from the ASF dual-hosted git repository. yiguolei pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 9d15ab7c04a [fix] arrow flight (#2211) 9d15ab7c04a is described below commit 9d15ab7c04a0fb7d2360141e51e52caaf817e2e9 Author: Xinyi Zou <zouxiny...@gmail.com> AuthorDate: Wed Mar 19 12:33:50 2025 +0800 [fix] arrow flight (#2211) ## Versions - [x] dev - [x] 3.0 - [x] 2.1 - [ ] 2.0 ## Languages - [ ] Chinese - [ ] English ## Docs Checklist - [ ] Checked by AI - [ ] Test Cases Built --- docs/db-connect/arrow-flight-sql-connect.md | 152 +++++++++++++++++++-- .../current/db-connect/arrow-flight-sql-connect.md | 152 +++++++++++++++++++-- .../db-connect/arrow-flight-sql-connect.md | 152 +++++++++++++++++++-- .../db-connect/arrow-flight-sql-connect.md | 152 +++++++++++++++++++-- .../db-connect/arrow-flight-sql-connect.md | 152 +++++++++++++++++++-- .../db-connect/arrow-flight-sql-connect.md | 152 +++++++++++++++++++-- 6 files changed, 840 insertions(+), 72 deletions(-) diff --git a/docs/db-connect/arrow-flight-sql-connect.md b/docs/db-connect/arrow-flight-sql-connect.md index a122ff16b8d..b2a10f63c73 100644 --- a/docs/db-connect/arrow-flight-sql-connect.md +++ b/docs/db-connect/arrow-flight-sql-connect.md @@ -451,7 +451,9 @@ try ( ### Choice of Jdbc and Java connection methods -Compared with the traditional `jdbc:mysql` connection method, the performance test of the Arrow Flight SQL connection method of Jdbc and Java can be found at [GitHub Issue](https://github.com/apache/doris/issues/25514). Here are some usage suggestions based on the test conclusions. +[JDBC/Java Arrow Flight SQL Sample](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/README.md) is a JDBC/Java demo using Arrow FLight SQL. You can use it to test various connection methods for sending queries to Arrow Flight Server, helping you understand how to use Arrow FLight SQL and test performance. Implemented in [Add Arrow Flight Sql demo for Java](https://github.com/apache/doris/pull/45306). + +Compared with the traditional `jdbc:mysql` connection method, the performance test of the Arrow Flight SQL connection method of Jdbc and Java can be found in Section 6.2 of [GitHub Issue](https://github.com/apache/doris/issues/25514). Here are some usage suggestions based on the test conclusions. 1. For the above three Java Arrow Flight SQL connection methods, if the subsequent data analysis will be based on the row-based data format, then use jdbc:arrow-flight-sql, which will return data in the JDBC ResultSet format; if the subsequent data analysis can be based on the Arrow format or other column-based data formats, then use Flight AdbcDriver or Flight JdbcDriver to directly return data in the Arrow format, which will avoid row-column conversion and use the characteristics of Ar [...] @@ -471,26 +473,152 @@ Arrow Flight currently has no official plan to support Spark and Flink ([GitHub The community previously referred to the open source [Spark-Flight-Connector](https://github.com/qwshen/spark-flight-connector) and used FlightClient in Spark to connect to Doris for testing. It was found that the data format conversion between Arrow and Doris Block is faster, which is 10 times the conversion speed between CSV format and Doris Block, and it has better support for complex types such as Map and Array. This is because the Arrow data format has a high compression rate and lo [...] +### Support BI tools + +Since Doris v2.1.8, BI tools such as DBeaver are supported to connect to Doris using the `arrow-flight-sql` protocol. For DBeaver's method of connecting to Doris using the `arrow-flight-sql` Driver, refer to: [how-to-use-jdbc-driver-with-dbeaver-client](https://www.dremio.com/blog/jdbc-driver-for-arrow-flight-sql/#h-how-to-use-jdbc-driver-with-dbeaver-client), [client-applications/clients/dbeaver/](https://docs.dremio.com/current/sonar/client-applications/clients/dbeaver/?_gl=1*1epgwh0*_ [...] + +## Extended Application + +### Multiple BEs return results in parallel + +Doris will aggregate the results of a query on all BE nodes to one BE node by default. In Mysql/JDBC queries, FE will request query results from this aggregated data node. In Arrow Flight SQL queries, FE will wrap the IP/Port of this node in the Endpoint and return it to ADBC Client. ADBC Client will request the BE node corresponding to this Endpoint to pull data. + +If the query is just a simple Select to pull data from Doris, without Join, Sort, Window Function and other operators with data Shuffle behavior, the query can be split according to Tablet granularity. Now Doris Spark/Flink Connector uses this method to implement parallel data reading, which is divided into two steps: +1. Execute `explain sql`, and the ScanOperator in the query plan returned by FE contains all Tablet ID Lists of Scan. +2. Split the original SQL into multiple SQLs based on the Tablet ID List above. Each SQL only reads part of the Tablet. The usage is similar to `SELECT * FROM t1 TABLET(10001,10002) limit 1000;`. The multiple SQLs after splitting can be executed in parallel. Refer to [Support select table sample](https://github.com/apache/doris/pull/10170). + +If the outermost layer of the query is aggregation, the SQL is similar to `select k1, sum(k2) from xxx group by k1`. After Doris v3.4, execute `set enable_parallel_result_sink=true;` to allow each BE node of a query to return query results independently. After receiving the Endpoint list returned by FE, ADBC Client pulls results from multiple BE nodes in parallel. However, please note that when the aggregation result is very small, returning multiple BEs will increase the pressure on R [...] + +### Multiple BEs share the same IP accessible from outside the cluster + +If there is a Doris cluster, its FE node can be accessed from outside the cluster, and all its BE nodes can only be accessed from inside the cluster. This is fine when using Mysql Client and JDBC to connect to Doris to execute queries, and the query results will be returned by the Doris FE node. However, using Arrow Flight SQL to connect to Doris cannot execute queries, because ADBC Client needs to connect to the Doris BE node to pull query results, but the Doris BE node is not allowed [...] + +In a production environment, it is often inconvenient to expose the Doris BE node outside the cluster. However, you can add a reverse proxy (such as Nginx) to all Doris BE nodes. When the client outside the cluster connects to Nginx, it will be randomly routed to a Doris BE node. By default, the Arrow Flight SQL query results will be randomly saved on a Doris BE node. If it is different from the Doris BE node randomly routed by Nginx, data forwarding is required within the Doris BE node. + +Starting from Doris v2.1.8, you can configure `public_host` and `arrow_flight_sql_proxy_port` in `be.conf` of all Doris BE nodes to the IP and port shared by multiple Doris BE nodes and accessible outside the cluster. The query results can be correctly forwarded and returned to the ADBC Client. +```conf +public_host={nginx ip} +arrow_flight_sql_proxy_port={nginx port} +``` + ## FAQ -1. ARM environment reports an error `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`. If the Linux kernel version is <= 4.19.90, you need to upgrade to 4.19.279 or above, or recompile Doris BE in the environment of the lower version of the Linux kernel. For specific compilation methods, refer to the document <docs/dev/install/source-install/compilation-arm> +1. Q: ARM environment reports an error `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`. + +A: If the Linux kernel version is <= 4.19.90, you need to upgrade to 4.19.279 or above, or recompile Doris BE in the environment of the lower version of the Linux kernel. For specific compilation methods, refer to the document <docs/dev/install/source-install/compilation-arm> + +Cause: This is because there is a compatibility issue between the old version of the Linux kernel and Arrow. `cpp: arrow::RecordBatch::MakeEmpty()` will get stuck when constructing Arrow Record Batch, causing Doris BE's Arrow Flight Server to fail to respond to Doris FE's Arrow Flight Server's RPC request within 5000ms, causing FE to return rpc timeout failed to Client. When Spark and Flink read Doris, they also convert the query results into Arrow Record Batch and return them, so the sa [...] + +The Linux kernel version of kylinv10 SP2 and SP3 is only 4.19.90-24.4.v2101.ky10.aarch64 at most. The kernel version cannot be upgraded further. Doris BE can only be recompiled on kylinv10. If the problem still exists after compiling Doris BE with the new version of ldb_toolchain, you can try to compile it with the lower version of ldb_toolchain v0.17. If your ARM environment cannot connect to the external network, Huawei Cloud provides ARM + kylinv10, and Alibaba Cloud provides x86 + kylinv10 + +2. Q: Prepared statement passes parameters and reports errors. + +A: Currently, `jdbc:arrow-flight-sql` and Java ADBC/JDBCDriver do not support prepared statement parameter passing. For example, `select * from xxx where id=?` will report an error `parameter ordinal 1 out of range`. This is a bug in Arrow Flight SQL ([GitHub Issue](https://github.com/apache/arrow/issues/40118)). + +3. Q: How to modify the batch size read by `jdbc:arrow-flight-sql` each time to improve performance in some scenarios. + +A: By modifying `setTargetBatchSize` in the `makeJdbcConfig` method in the `org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader` file, the default is 1024, and then saving the modified file to the local directory with the same path name, so as to overwrite the original file and take effect. + +4. Q: ADBC v0.10, JDBC and Java ADBC/JDBCDriver do not support parallel reading. + +A: The `stmt.executePartitioned()` method is not implemented. You can only use the native FlightClient to implement parallel reading of multiple endpoints, using the method `sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`. In addition, the default AdbcStatement of ADBC V0.10 is actually JdbcStatement. After executeQuery, the row-format JDBC ResultSet is converted back to the Arrow column format. I [...] + +5. Q: Specify the database name in the URL. + +A: As of Arrow v15.0, Arrow JDBC Connector does not support specifying database name in URL. For example, specifying connection to `test` database in `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` is invalid, and you can only execute SQL `use database` manually. Arrow v18.0 supports specifying database name in URL, but there are still bugs in actual testing. + +6. Q: Python ADBC prints `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant`. + +A: Ignore this Warning when using Python. This is a problem with Python ADBC Client and will not affect queries. + +7. Q: Python reports an error `grpc: received message larger than max (20748753 vs. 16777216)`. + +A: Refer to [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) and add `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value` in Database Option. + +8. Q: Error `invalid bearer token` is reported. + +A: Execute `SET PROPERTY FOR 'root' 'max_user_connections' = '10000';` to change the current maximum number of connections for the current user to 10000; add qe_max_connection=30000 and arrow_flight_token_cache_size=8000 in `fe.conf` and restart FE. + +The connection between the ADBC Client and the Arrow Flight Server is essentially a long link, which requires Auth Token, Connection, and Session to be cached on the Server. After the connection is created, it will not be disconnected immediately at the end of a single query. The Client needs to send a close() request to clean it up, but in fact, the Client often does not send a close request, so the Auth Token, Connection, and Session will be saved on the Arrow Flight Server for a lon [...] + +As of Doris v2.1.8, Arrow Flight connections and Mysql/JDBC connections use the same connection limit, including the total number of connections of all FE users `qe_max_connection` and the number of connections of a single user `max_user_connections` in `UserProperty`. But the default `qe_max_connection` and `max_user_connections` are 1024 and 100 respectively. Arrow Flight SQL is often used to replace JDBC scenarios, but the JDBC connection will be released immediately after the query e [...] + +9. Q: Java Arrow Flight SQL reads the Datatime type and returns a timestamp instead of the formatted time. + +A: Java Arrow Flight SQL needs to convert the timestamp by itself when reading the Datatime type. Refer to [Add java parsing datetime type in arrow flight sql sample #48578](https://github.com/apache/doris/pull/48578). Using Python Arrow Flight SQL to read the Datatime type returns `2025-03-03 17:23:28Z`, while Java Arrow Flight SQL returns `1740993808`. + +10. Q: Java Arrow Flight JDBC Client reports an error `Configuration does not provide a mapping for array column 2` when reading Array nested types. + +A: Refer to [`sample/arrow-flight-sql`](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/src/main/java/doris/arrowflight/demo/FlightAdbcDriver.java) to use JAVA ADBC Client. + +Python ADBC Client, JAVA ADBC Client, and Java JDBC DriverManager all have no problem reading Array nested types. Only Java Arrow Flight JDBC Client has problems. In fact, the compatibility of Arrow Flight JDBC Client is not guaranteed. It is not officially developed by Arrow, but by a third-party database company Dremio. Other compatibility issues have been found before, so it is recommended to use JAVA ADBC Client first. + +## Release Note + +> Arrow Flight SQL protocol is supported since Doris 2.1. As of Doris 2.1.9, the issues fixed are listed based on Doris 2.1 series versions. Doris 3.0 series versions are self-checked. + +### v2.1.9 + +1. Fix the problem of Doris data serialization to Arrow. +[Fix UT DataTypeSerDeArrowTest of Array/Map/Struct/Bitmap/HLL/Decimal256 types](https://github.com/apache/doris/pull/48944) +- Failed to read `Decimal256` type; +- Subtle error in reading `DatetimeV2` type; +- Incorrect result in reading `DateV2` type; +- Error when reading `IPV4/IPV6` type result is NULL; + +2. Fix the problem that Doris Arrow Flight SQL query fails and returns empty result, without returning real error information. +[Fix query result is empty and not return query error message](https://github.com/apache/doris/pull/45023) + +### v2.1.8 + +1. Support BI tools such as DBeaver to connect to Doris using the `arrow-flight-sql` protocol, and support the correct display of metadata trees. +[Support arrow-flight-sql protocol getStreamCatalogs, getStreamSchemas, getStreamTables #46217](https://github.com/apache/doris/pull/46217). + +2. When multiple BEs share the same IP that is accessible to the outside of the cluster, the query results can be correctly forwarded and returned to the ADBC Client. +[Arrow flight server supports data forwarding when BE uses public vip](https://github.com/apache/doris/pull/43281) + +3. Support multiple endpoints to read in parallel. +[Arrow Flight support multiple endpoints](https://github.com/apache/doris/pull/44286) + +4. Fix query error `FE not found arrow flight schema`. +[Fix FE not found arrow flight schema](https://github.com/apache/doris/pull/43960) + +5. Fix error `BooleanBuilder::AppendValues` when reading columns that allow NULL. +[Fix Doris NULL column conversion to arrow batch](https://github.com/apache/doris/pull/43929) + +6. Fix `show processlist` displays duplicate Connection IDs. +[Fix arrow-flight-sql ConnectContext to use a unified ID #46284](https://github.com/apache/doris/pull/46284) + +### v2.1.7 + +1. Fix frequent log printing `Connection wait_timeout`. +[Fix kill timeout FlightSqlConnection and FlightSqlConnectProcessor close](https://github.com/apache/doris/pull/41770) + +2. Fix Arrow Flight Bearer Token expiration from Cache. +[Fix Arrow Flight bearer token cache evict after expired](https://github.com/apache/doris/pull/41754) + +### v2.1.6 -Problem cause: This is because there is a compatibility issue between the old version of the Linux kernel and Arrow. `cpp: arrow::RecordBatch::MakeEmpty()` will get stuck when constructing Arrow Record Batch, causing Doris BE's Arrow Flight Server to fail to respond to Doris FE's Arrow Flight Server RPC request within 5000ms, causing FE to return rpc timeout failed to Client. When Spark and Flink read Doris, they also convert the query results into Arrow Record Batch and return them, so [...] +1. Fix query error `0.0.0.0:xxx, connection refused`. +[Fix return result from FE Arrow Flight server error 0.0.0.0:xxx, connection refused](https://github.com/apache/doris/pull/40002) -The Linux kernel version of kylinv10 SP2 and SP3 is only up to 4.19.90-24.4.v2101.ky10.aarch64. You cannot continue to upgrade the kernel version. You can only recompile Doris BE on kylinv10. If the problem still exists after compiling Doris BE with the new version of ldb_toolchain, you can try to compile it with the lower version of ldb_toolchain v0.17. If your ARM environment cannot connect to the Internet, Huawei Cloud provides ARM + kylinv10, and Alibaba Cloud provides x86 + kylinv10 +2. Fix query error `Reach limit of connections`. +[Fix exceed user property max connection cause Reach limit of connections #39127](https://github.com/apache/doris/pull/39127) -2. Currently, `jdbc:arrow-flight-sql` and Java ADBC/JDBCDriver do not support prepared statement passing parameters. For example, `select * from xxx where id=?` will report an error `parameter ordinal 1 out of range`. This is a bug in Arrow Flight SQL ([GitHub Issue](https://github.com/apache/arrow/issues/40118)) +In previous versions, execute `SET PROPERTY FOR 'root' 'max_user_connections' = '1024';` to modify the current maximum number of connections for the current user to 1024, which can be temporarily circumvented. -3. Modification `jdbc:arrow-flight-sql` The batch size of each read can improve performance in some scenarios. By modifying the `setTargetBatchSize` in the `makeJdbcConfig` method in the `org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader` file, the default is 1024, and then saving the modified file to the local directory with the same name, it will overwrite the original file to take effect. +Because the previous version only limits the number of Arrow Flight connections to less than `qe_max_connection/2`, `qe_max_connection` is the total number of connections for all fe users, the default is 1024, and does not limit the number of Arrow Flight connections for a single user to less than `max_user_connections` in `UserProperty`, the default is 100, so when the number of Arrow Flight connections exceeds the upper limit of the current user's connection number, an error `Reach lim [...] -4. ADBC v0.10, JDBC and Java ADBC/JDBCDriver do not support parallel reading, and the `stmt.executePartitioned()` method is not implemented. You can only use the native FlightClient to implement parallel reading of multiple Endpoints, using the method `sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`. In addition, the default AdbcStatement of ADBC V0.10 is actually JdbcStatement. After executeQue [...] +For details of the problem, see: [Questions](https://ask.selectdb.com/questions/D18b1/2-1-4-ban-ben-python-shi-yong-arrow-flight-sql-lian-jie-bu-hui-duan-kai-lian-jie-shu-zhan-man-da-dao-100/E1ic1?commentId=10070000000005324) -5. As of Arrow v15.0, Arrow JDBC Connector does not support specifying the database name in the URL. For example, `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` specifies that the connection to the `test` database is invalid. You can only execute the SQL `use database` manually. +3. Add Conf `arrow_flight_result_sink_buffer_size_rows` to support modifying the ArrowBatch size of query results returned in a single time, the default is 4096 * 8. +[Add config arrow_flight_result_sink_buffer_size_rows](https://github.com/apache/doris/pull/38221) -6. There is a bug in Doris 2.1.4 version. There is a chance of error when reading large amounts of data. This bug is fixed in [Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) PR. Upgrading Doris 2.1.5 version can solve this problem. For details of the problem, see: [Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) +### v2.1.5 -7. `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant` Ignore this warning when using Python. This is a problem with the Python ADBC Client and will not affect the query. +1. Fix the problem that Arrow Flight SQL query results are empty. +[Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) -8. Python reports an error `grpc: received message larger than max (20748753 vs. 16777216)`. Refer to [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) to add `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value` in Database Option. +Doris v2.1.4 may report an error when reading large amounts of data. For details, see: [Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) -9. Before Doris version 2.1.7, the error `Reach limit of connections` is reported. This is because there is no limit on the number of Arrow Flight connections for a single user, which is less than `max_user_connections` in `UserProperty`, which is 100 by default. You can modify the current maximum number of connections for Billie user to 100 by `SET PROPERTY FOR 'Billie' 'max_user_connections' = '1000';`, or add `arrow_flight_token_cache_size=50` in `fe.conf` to limit the overall number [...] +### Doris Arrow Flight v2.1.4 and earlier versions are not perfect. It is recommended to upgrade before use. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/db-connect/arrow-flight-sql-connect.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/db-connect/arrow-flight-sql-connect.md index 21f0b1eb3f2..19b5abaaba4 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/db-connect/arrow-flight-sql-connect.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/db-connect/arrow-flight-sql-connect.md @@ -451,7 +451,9 @@ try ( ### JDBC 和 Java 连接方式的选择 -对比传统的 `jdbc:mysql` 连接方式,Jdbc 和 Java 的 Arrow Flight SQL 连接方式的性能测试见 [GitHub Issue](https://github.com/apache/doris/issues/25514),这里基于测试结论给出一些使用建议。 +[JDBC/Java Arrow Flight SQL Sample](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/README.md) 是 JDBC/Java 使用 Arrow FLight SQL 的 demo,你可以使用它来测试向 Arrow Flight Server 发送查询的多种连接方法,帮助你了解如何使用 Arrow FLight SQL 并测试性能。在 [Add Arrow Flight Sql demo for Java](https://github.com/apache/doris/pull/45306) 中实现。 + +对比传统的 `jdbc:mysql` 连接方式,Jdbc 和 Java 的 Arrow Flight SQL 连接方式的性能测试见 Section 6.2 of [GitHub Issue](https://github.com/apache/doris/issues/25514),这里基于测试结论给出一些使用建议。 1. 上述三种 Java Arrow Flight SQL 连接方式的选择上,如果后续的数据分析将基于行存的数据格式,那么使用 jdbc:arrow-flight-sql,这将返回 JDBC ResultSet 格式的数据;如果后续的数据分析可以基于 Arrow 格式或其他列存数据格式,那么使用 Flight AdbcDriver 或 Flight JdbcDriver 直接返回 Arrow 格式的数据,这将避免行列转换,并可利用 Arrow 的特性加速数据解析。 @@ -471,26 +473,152 @@ Arrow Flight 官方目前没有支持 Spark 和 Flink 的计划(见 [GitHub Is 社区之前参考开源的 [Spark-Flight-Connector](https://github.com/qwshen/spark-flight-connector),在 Spark 中使用 FlightClient 连接 Doris 测试,发现 Arrow 与 Doris Block 之间数据格式转换的速度更快,是 CSV 格式与 Doris Block 之间转换速度的 10 倍,而且对 Map,Array 等复杂类型的支持更好,这是因为 Arrow 数据格式的压缩率高,传输时网络开销小。不过目前 Doris Arrow Flight 还没有实现多节点并行读取,仍是将查询结果汇总到一台 BE 节点后返回,对简单的批量导出数据而言,性能可能没有 Doris Spark Connector 快,后者支持 Tablet 级别的并行读取。如果你希望在 Spark 使用 Arrow Flight SQL 连接 Doris,可以参考开源的 [Spark-Flight-Connector](https://github.com/qwshen/spark-flight-connec [...] +### 支持 BI 工具 + +自 Doris v2.1.8 开始,支持 DBeaver 等 BI 工具使用 `arrow-flight-sql` 协议连接 Doris。DBeaver 使用 `arrow-flight-sql` Driver 连接 Doris 的方法参考: [how-to-use-jdbc-driver-with-dbeaver-client](https://www.dremio.com/blog/jdbc-driver-for-arrow-flight-sql/#h-how-to-use-jdbc-driver-with-dbeaver-client),[client-applications/clients/dbeaver/](https://docs.dremio.com/current/sonar/client-applications/clients/dbeaver/?_gl=1*1epgwh0*_gcl_au*MjUyNjE1ODM0LjE3MzQwMDExNDg)。 + +## 扩展应用 + +### 多 BE 并行返回结果 + +Doris 默认会将一个查询在所有 BE 节点上的结果汇总聚合到一个 BE 节点上,在 Mysql/JDBC 查询中 FE 会向这个汇总数据的节点请求查询结果,在 Arrow Flight SQL 查询中 FE 会将这个节点的 IP/Port 包装在 Endpoint 中返回给 ADBC Client,ADBC Client 会请求这个 Endpoint 对应的 BE 节点拉取数据。 + +如果查询只是简单的 Select 从 Doris 拉取数据,没有 Join、Sort、窗口函数等有数据 Shuffle 行为的算子,可以将查询按照 Tablet 粒度拆分,现在 Doris Spark/Flink Connector 就是用的这个方法实现并行读取数据,分为两个步骤: + 1. 执行 `explain sql`,FE 返回的查询计划中 ScanOperator 包含 Scan 的所有 Tablet ID List。 + 2. 依据上面的 Tablet ID List 将原始 SQL 拆分为多个 SQL,每个 SQL 只读取部分 Tablet,用法类似 `SELECT * FROM t1 TABLET(10001,10002) limit 1000;`,拆分后的多个 SQL 可以并行执行。参考 [Support select table sample](https://github.com/apache/doris/pull/10170)。 + +如果查询最外层是聚合,SQL 类似 `select k1, sum(k2) from xxx group by k1`,Doris v3.4 版本后,执行 `set enable_parallel_result_sink=true;` 后允许一个查询的每个 BE 节点独立返回查询结果,ADBC Client 收到 FE 返回的 Endpoint 列表后并行从多个 BE 节点拉取结果。不过注意当聚合结果很小时,多 BE 返回会增加 RPC 的压力。具体实现参考 [support parallel result sink](https://github.com/apache/doris/pull/36053)。理论上除了最外层是排序的查询,其他查询都可以支持每个 BE 节点并行返回结果,不过暂时没有这方便的需求,没有更进一步实现。 + +### 多 BE 共享同一个可供集群外部访问的 IP + +如果存在一个 Doris 集群,它的 FE 节点可以被集群外部访问,它的所有 BE 节点只可以被集群内部访问。这在使用 Mysql Client 和 JDBC 连接 Doris 执行查询是没问题的,查询结果将由 Doris FE 节点返回给。但使用 Arrow Flight SQL 连接 Doris 无法执行查询,因为 ADBC Client 需要连接 Doris BE 节点拉取查询结果,但 Doris BE 节点不允许被集群外部访问。 + +在生产环境中,很多时候不方便将 Doris BE 节点暴露到集群外。但可以为所有 Doris BE 节点增加了一层反向代理(比如 Nginx),集群外部的 Client 连接 Nginx 时会随机路由到一台 Doris BE 节点上。默认情况下,Arrow Flight SQL 查询结果会随机保存在一台 Doris BE 节点上,如果和 Nginx 随机路由的 Doris BE 节点不同,需要在 Doris BE 节点内部做一次数据转发。 + +自 Doris v2.1.8 开始,你可以在所有 Doris BE 节点的 `be.conf` 中将 `public_host` 和 `arrow_flight_sql_proxy_port` 配置成多 Doris BE 节点共享的可供集群外部访问的 IP 和端口,查询结果可以正确转发后返回 ADBC Client。 +```conf + public_host={nginx ip} + arrow_flight_sql_proxy_port={nginx port} +``` + ## FAQ -1. ARM 环境报错 `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`,如果 Linux 内核版本 <= 4.19.90,需要升级到 4.19.279 及以上,或者在低版本 Linux 内核的环境中重新编译 Doris BE,具体编译方法参考文档<docs/dev/install/source-install/compilation-arm> +1. Q: ARM 环境报错 `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`。 + + A: 如果 Linux 内核版本 <= 4.19.90,需要升级到 4.19.279 及以上,或者在低版本 Linux 内核的环境中重新编译 Doris BE,具体编译方法参考文档<docs/dev/install/source-install/compilation-arm> + + 问题原因:这是因为老版本 Linux 内核和 Arrow 存在兼容性问题,`cpp: arrow::RecordBatch::MakeEmpty()` 构造 Arrow Record Batch 时会卡住,导致 Doris BE 的 Arrow Flight Server 在 5000ms 内没有回应 Doris FE 的 Arrow Flight Server 的 RPC 请求,导致 FE 给 Client 返回 rpc timeout failed。Spark 和 Flink 读取 Doris 时也是将查询结果转换成 Arrow Record Batch 后返回,所以也存在同样的问题。 + + kylinv10 SP2 和 SP3 的 Linux 内核版本最高只有 4.19.90-24.4.v2101.ky10.aarch64,无法继续升级内核版本,只能在 kylinv10 上重新编译 Doris BE,如果使用新版本 ldb_toolchain 编译 Doris BE 后问题依然存在,可以尝试使用低版本 ldb_toolchain v0.17 编译,如果你的 ARM 环境无法连外网,华为云提供 ARM + kylinv10,阿里云提供 x86 + kylinv10 + +2. Q: prepared statement 传递参数报错。 + + A: 目前 `jdbc:arrow-flight-sql` 和 Java ADBC/JDBCDriver 不支持 prepared statement 传递参数,类似`select * from xxx where id=?`,将报错 `parameter ordinal 1 out of range`,这是 Arrow Flight SQL 的一个 BUG([GitHub Issue](https://github.com/apache/arrow/issues/40118))。 + +3. Q: 如何修改 `jdbc:arrow-flight-sql` 每次读取的批次大小,在某些场景下提升性能。 + + A: 通过修改`org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader`文件中`makeJdbcConfig`方法中的 `setTargetBatchSize`,默认是 1024,然后将修改后的文件保存到本地同名路径目录下,从而覆盖原文件生效。 + +4. Q: ADBC v0.10,JDBC 和 Java ADBC/JDBCDriver 不支持并行读取。 + + A: 没有实现`stmt.executePartitioned()`这个方法,只能使用原生的 FlightClient 实现并行读取多个 Endpoints, 使用方法`sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`,此外,ADBC V0.10 默认的 AdbcStatement 实际是 JdbcStatement,executeQuery 后将行存格式的 JDBC ResultSet 又重新转成的 Arrow 列存格式,预期到 ADBC 1.0.0 时 Java ADBC 将功能完善 [GitHub Issue](https://github.com/apache/arrow-adbc/issues/1490)。 + +5. Q: 在 URL 中指定 database name。 + + A: 截止 Arrow v15.0,Arrow JDBC Connector 不支持在 URL 中指定 database name,比如 `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` 中指定连接`test` database 无效,只能手动执行 SQL `use database`。Arrow v18.0 支持了在 URL 中指定 database name,但实测仍有 BUG。 + +6. Q: Python ADBC print `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant`。 + + A: 使用 Python 时忽略这个 Warning,这是 Python ADBC Client 的问题,不会影响查询。 + +7. Q: Python 报错 `grpc: received message larger than max (20748753 vs. 16777216)`。 + + A: 参考 [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) 在 Database Option 中增加 `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value`. + +8. Q: 报错 `invalid bearer token`。 + + A: 执行 `SET PROPERTY FOR 'root' 'max_user_connections' = '10000';` 修改当前用户的当前最大连接数到 10000;在 `fe.conf` 增加 qe_max_connection=30000 和 arrow_flight_token_cache_size=8000 并重启 FE。 + + ADBC Client 和 Arrow Flight Server 端之间的连接本质上是个长链接,需要在 Server 缓存 Auth Token、Connection、Session,连接创建后不会在单个查询结束时立即断开,需要 Client 发送 close() 请求后清理,但实际上 Client 经常不会发送 close 请求,所以 Auth Token、Connection、Session 会长时间在 Arrow Flight Server 上保存,默认会在 3 天后超时断开,或者在连接数超过 `arrow_flight_token_cache_size` 的限制后依据 LRU 淘汰。 + + 截止 Doris v2.1.8,Arrow Flight 连接和 Mysql/JDBC 连接使用相同的连接数限制,包括 FE 所有用户的总连接数 `qe_max_connection` 和单个用户的连接数 `UserProperty` 中的 `max_user_connections`。但默认的 `qe_max_connection` 和 `max_user_connections` 分别是 1024 和 100。Arrow Flight SQL 常用来取代使用 JDBC 的场景,但 JDBC 连接会在查询结束后立即释放,所以使用 Arrow Flight SQL 时,Doris 默认的连接数限制太小,经常导致连接数超过 `arrow_flight_token_cache_size` 的限制后将仍在被使用的连接淘汰。 + +9. Q: Java Arrow Flight SQL 读取 Datatime 类型返回时间戳,而不是格式化时间。 + + A: Java Arrow Flight SQL 读取 Datatime 类型需要自行转换时间戳,参考 [Add java parsing datetime type in arrow flight sql sample #48578](https://github.com/apache/doris/pull/48578)。用 Python Arrow Flight SQL 读取 Datatime 类型返回结果为 `2025-03-03 17:23:28Z`,而 Java Arrow Flight SQL 返回 `1740993808`。 + +10. Q: Java Arrow Flight JDBC Client 读取 Array 嵌套类型报错 `Configuration does not provide a mapping for array column 2`。 + + A: 参考 [`sample/arrow-flight-sql`](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/src/main/java/doris/arrowflight/demo/FlightAdbcDriver.java) 使用 JAVA ADBC Client。 + + Python ADBC Client、JAVA ADBC Client、Java JDBC DriverManager 读取 Array 嵌套类型都没问题,只有 Java Arrow Flight JDBC Client 有问题,实际上 Arrow Flight JDBC Client 的兼容性不好保证,不是 Arrow 官方开发的,由一个第三方数据库公司 Dremio 开发,之前还发现过其他兼容性问题,所以建议优先用 JAVA ADBC Client。 + +## Release Note + +> 自 Doris 2.1 开始支持 Arrow Flight SQL 协议,截止 Doris 2.1.9,以 Doris 2.1 系列版本为准列举修复的问题,Doris 3.0 系列版本自行对照。 + +### v2.1.9 + +1. 修复 Doris 数据序列化到 Arrow 的问题。 + [Fix UT DataTypeSerDeArrowTest of Array/Map/Struct/Bitmap/HLL/Decimal256 types](https://github.com/apache/doris/pull/48944) + - 读取 `Decimal256` 类型失败; + - 读取 `DatetimeV2` 类型微妙部分错误; + - 读取 `DateV2` 类型结果不正确; + - 读取 `IPV4/IPV6` 类型结果为 NULL 时报错; + +2. 修复 Doris Arrow Flight SQL 查询失败返回空结果,没有返回真实的错误信息。 + [Fix query result is empty and not return query error message](https://github.com/apache/doris/pull/45023) + +### v2.1.8 + +1. 支持 DBeaver 等 BI 工具使用 `arrow-flight-sql` 协议连接 Doris,支持正确显示元数据树。 + [Support arrow-flight-sql protocol getStreamCatalogs, getStreamSchemas, getStreamTables #46217](https://github.com/apache/doris/pull/46217)。 + +2. 支持多 BE 共享同一个可供集群外部访问的 IP 时,查询结果可以正确转发后返回 ADBC Client。 + [Arrow flight server supports data forwarding when BE uses public vip](https://github.com/apache/doris/pull/43281) + +3. 支持多个 Endpoint 并行读取。 + [Arrow Flight support multiple endpoints](https://github.com/apache/doris/pull/44286) + +4. 修复查询报错 `FE not found arrow flight schema`。 + [Fix FE not found arrow flight schema](https://github.com/apache/doris/pull/43960) + +5. 修复读取允许 NULL 的列报错 `BooleanBuilder::AppendValues`。 + [Fix Doris NULL column conversion to arrow batch](https://github.com/apache/doris/pull/43929) + +6. 修复 `show processlist` 显示重复的 Connection ID。 + [Fix arrow-flight-sql ConnectContext to use a unified ID #46284](https://github.com/apache/doris/pull/46284) + +### v2.1.7 + +1. 修复频繁打印日志 `Connection wait_timeout`。 + [Fix kill timeout FlightSqlConnection and FlightSqlConnectProcessor close](https://github.com/apache/doris/pull/41770) -问题原因:这是因为老版本 Linux 内核和 Arrow 存在兼容性问题,`cpp: arrow::RecordBatch::MakeEmpty()` 构造 Arrow Record Batch 时会卡住,导致 Doris BE 的 Arrow Flight Server 在 5000ms 内没有回应 Doris FE 的 Arrow Flight Server 的 RPC 请求,导致 FE 给 Client 返回 rpc timeout failed。Spark 和 Flink 读取 Doris 时也是将查询结果转换成 Arrow Record Batch 后返回,所以也存在同样的问题。 +2. 修复 Arrow Flight Bearer Token 过期后从 Cache 中淘汰。 + [Fix Arrow Flight bearer token cache evict after expired](https://github.com/apache/doris/pull/41754) -kylinv10 SP2 和 SP3 的 Linux 内核版本最高只有 4.19.90-24.4.v2101.ky10.aarch64,无法继续升级内核版本,只能在 kylinv10 上重新编译 Doris BE,如果使用新版本 ldb_toolchain 编译 Doris BE 后问题依然存在,可以尝试使用低版本 ldb_toolchain v0.17 编译,如果你的 ARM 环境无法连外网,华为云提供 ARM + kylinv10,阿里云提供 x86 + kylinv10 +### v2.1.6 -2. 目前 `jdbc:arrow-flight-sql` 和 Java ADBC/JDBCDriver 不支持 prepared statement 传递参数,类似`select * from xxx where id=?`,将报错 `parameter ordinal 1 out of range`,这是 Arrow Flight SQL 的一个 BUG([GitHub Issue](https://github.com/apache/arrow/issues/40118)) +1. 修复查询报错 `0.0.0.0:xxx, connection refused`。 + [Fix return result from FE Arrow Flight server error 0.0.0.0:xxx, connection refused](https://github.com/apache/doris/pull/40002) -3. 修改 `jdbc:arrow-flight-sql` 每次读取的批次大小,在某些场景下可以提升性能,通过修改`org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader`文件中`makeJdbcConfig`方法中的 `setTargetBatchSize`,默认是 1024,然后将修改后的文件保存到本地同名路径目录下,从而覆盖原文件生效。 +2. 修复查询报错 `Reach limit of connections`。 + [Fix exceed user property max connection cause Reach limit of connections #39127](https://github.com/apache/doris/pull/39127) + + 之前的版本执行 `SET PROPERTY FOR 'root' 'max_user_connections' = '1024';` 修改当前用户的当前最大连接数到 1024,可临时规避。 -4. ADBC v0.10,JDBC 和 Java ADBC/JDBCDriver 还不支持并行读取,没有实现`stmt.executePartitioned()`这个方法,只能使用原生的 FlightClient 实现并行读取多个 Endpoints, 使用方法`sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`,此外,ADBC V0.10 默认的 AdbcStatement 实际是 JdbcStatement,executeQuery 后将行存格式的 JDBC ResultSet 又重新转成的 Arrow 列存格式,预期到 ADBC 1.0.0 时 Java ADBC 将功能完善 [GitHub Issue](https://github.com/apache/arrow-adbc/issues/1490)。 + 因为之前的版本只限制 Arrow Flight 连接数小于 `qe_max_connection/2`,`qe_max_connection` 是 fe 所有用户的总连接数,默认 1024,没有限制单个用户的 Arrow Flight 连接数小于 `UserProperty` 中的 `max_user_connections`,默认 100,所以当 Arrow Flight 连接数超过当前用户连接数上限时将报错 `Reach limit of connections`,所以需调大当前用户的 `max_user_connections`。 + + 问题详情见:[Questions](https://ask.selectdb.com/questions/D18b1/2-1-4-ban-ben-python-shi-yong-arrow-flight-sql-lian-jie-bu-hui-duan-kai-lian-jie-shu-zhan-man-da-dao-100/E1ic1?commentId=10070000000005324) -5. 截止 Arrow v15.0,Arrow JDBC Connector 不支持在 URL 中指定 database name,比如 `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` 中指定连接`test` database 无效,只能手动执行 SQL `use database`。 +3. 增加 Conf `arrow_flight_result_sink_buffer_size_rows`,支持修改单次返回的查询结果 ArrowBatch 大小,默认 4096 * 8。 + [Add config arrow_flight_result_sink_buffer_size_rows](https://github.com/apache/doris/pull/38221) -6. Doris 2.1.4 version 存在一个 Bug,读取大数据量时有几率报错,在 [Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) 修复,升级 Doris 2.1.5 version 可以解决。问题详情见:[Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) +### v2.1.5 -7. `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant` 使用 Python 时忽略这个 Warning,这是 Python ADBC Client 的问题,这不会影响查询。 +1. 修复Arrow Flight SQL 查询结果为空。 + [Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) -8. Python 报错 `grpc: received message larger than max (20748753 vs. 16777216)`,参考 [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) 在 Database Option 中增加 `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value`. + Doris v2.1.4 读取大数据量时有几率报错,问题详情见:[Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) -9. Doris version 2.1.7 版本之前,报错 `Reach limit of connections`,这是因为没有限制单个用户的 Arrow Flight 连接数小于 `UserProperty` 中的 `max_user_connections`,默认 100,可以通过 `SET PROPERTY FOR 'Billie' 'max_user_connections' = '1000';` 修改 Billie 用户的当前最大连接数到 100,或者在 `fe.conf` 中增加 `arrow_flight_token_cache_size=50` 来限制整体的 Arrow Flight 连接数。Doris version 2.1.7 版本之前 Arrow Flight 连接默认 3 天 超时断开,只强制连接数小于 `qe_max_connection/2`,超过时依据 lru 淘汰,`qe_max_connection` 是 fe 所有用户的总连接数,默认 1024。具体可以看 `arrow_flight_token_cache_size` 这个 co [...] +### v2.1.4 及之前的版本 Doris Arrow Flight 不够完善,建议升级后使用。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/db-connect/arrow-flight-sql-connect.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/db-connect/arrow-flight-sql-connect.md index df5765507da..19b5abaaba4 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/db-connect/arrow-flight-sql-connect.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/db-connect/arrow-flight-sql-connect.md @@ -451,7 +451,9 @@ try ( ### JDBC 和 Java 连接方式的选择 -对比传统的 `jdbc:mysql` 连接方式,Jdbc 和 Java 的 Arrow Flight SQL 连接方式的性能测试见 [GitHub Issue](https://github.com/apache/doris/issues/25514),这里基于测试结论给出一些使用建议。 +[JDBC/Java Arrow Flight SQL Sample](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/README.md) 是 JDBC/Java 使用 Arrow FLight SQL 的 demo,你可以使用它来测试向 Arrow Flight Server 发送查询的多种连接方法,帮助你了解如何使用 Arrow FLight SQL 并测试性能。在 [Add Arrow Flight Sql demo for Java](https://github.com/apache/doris/pull/45306) 中实现。 + +对比传统的 `jdbc:mysql` 连接方式,Jdbc 和 Java 的 Arrow Flight SQL 连接方式的性能测试见 Section 6.2 of [GitHub Issue](https://github.com/apache/doris/issues/25514),这里基于测试结论给出一些使用建议。 1. 上述三种 Java Arrow Flight SQL 连接方式的选择上,如果后续的数据分析将基于行存的数据格式,那么使用 jdbc:arrow-flight-sql,这将返回 JDBC ResultSet 格式的数据;如果后续的数据分析可以基于 Arrow 格式或其他列存数据格式,那么使用 Flight AdbcDriver 或 Flight JdbcDriver 直接返回 Arrow 格式的数据,这将避免行列转换,并可利用 Arrow 的特性加速数据解析。 @@ -471,26 +473,152 @@ Arrow Flight 官方目前没有支持 Spark 和 Flink 的计划(见 [GitHub Is 社区之前参考开源的 [Spark-Flight-Connector](https://github.com/qwshen/spark-flight-connector),在 Spark 中使用 FlightClient 连接 Doris 测试,发现 Arrow 与 Doris Block 之间数据格式转换的速度更快,是 CSV 格式与 Doris Block 之间转换速度的 10 倍,而且对 Map,Array 等复杂类型的支持更好,这是因为 Arrow 数据格式的压缩率高,传输时网络开销小。不过目前 Doris Arrow Flight 还没有实现多节点并行读取,仍是将查询结果汇总到一台 BE 节点后返回,对简单的批量导出数据而言,性能可能没有 Doris Spark Connector 快,后者支持 Tablet 级别的并行读取。如果你希望在 Spark 使用 Arrow Flight SQL 连接 Doris,可以参考开源的 [Spark-Flight-Connector](https://github.com/qwshen/spark-flight-connec [...] +### 支持 BI 工具 + +自 Doris v2.1.8 开始,支持 DBeaver 等 BI 工具使用 `arrow-flight-sql` 协议连接 Doris。DBeaver 使用 `arrow-flight-sql` Driver 连接 Doris 的方法参考: [how-to-use-jdbc-driver-with-dbeaver-client](https://www.dremio.com/blog/jdbc-driver-for-arrow-flight-sql/#h-how-to-use-jdbc-driver-with-dbeaver-client),[client-applications/clients/dbeaver/](https://docs.dremio.com/current/sonar/client-applications/clients/dbeaver/?_gl=1*1epgwh0*_gcl_au*MjUyNjE1ODM0LjE3MzQwMDExNDg)。 + +## 扩展应用 + +### 多 BE 并行返回结果 + +Doris 默认会将一个查询在所有 BE 节点上的结果汇总聚合到一个 BE 节点上,在 Mysql/JDBC 查询中 FE 会向这个汇总数据的节点请求查询结果,在 Arrow Flight SQL 查询中 FE 会将这个节点的 IP/Port 包装在 Endpoint 中返回给 ADBC Client,ADBC Client 会请求这个 Endpoint 对应的 BE 节点拉取数据。 + +如果查询只是简单的 Select 从 Doris 拉取数据,没有 Join、Sort、窗口函数等有数据 Shuffle 行为的算子,可以将查询按照 Tablet 粒度拆分,现在 Doris Spark/Flink Connector 就是用的这个方法实现并行读取数据,分为两个步骤: + 1. 执行 `explain sql`,FE 返回的查询计划中 ScanOperator 包含 Scan 的所有 Tablet ID List。 + 2. 依据上面的 Tablet ID List 将原始 SQL 拆分为多个 SQL,每个 SQL 只读取部分 Tablet,用法类似 `SELECT * FROM t1 TABLET(10001,10002) limit 1000;`,拆分后的多个 SQL 可以并行执行。参考 [Support select table sample](https://github.com/apache/doris/pull/10170)。 + +如果查询最外层是聚合,SQL 类似 `select k1, sum(k2) from xxx group by k1`,Doris v3.4 版本后,执行 `set enable_parallel_result_sink=true;` 后允许一个查询的每个 BE 节点独立返回查询结果,ADBC Client 收到 FE 返回的 Endpoint 列表后并行从多个 BE 节点拉取结果。不过注意当聚合结果很小时,多 BE 返回会增加 RPC 的压力。具体实现参考 [support parallel result sink](https://github.com/apache/doris/pull/36053)。理论上除了最外层是排序的查询,其他查询都可以支持每个 BE 节点并行返回结果,不过暂时没有这方便的需求,没有更进一步实现。 + +### 多 BE 共享同一个可供集群外部访问的 IP + +如果存在一个 Doris 集群,它的 FE 节点可以被集群外部访问,它的所有 BE 节点只可以被集群内部访问。这在使用 Mysql Client 和 JDBC 连接 Doris 执行查询是没问题的,查询结果将由 Doris FE 节点返回给。但使用 Arrow Flight SQL 连接 Doris 无法执行查询,因为 ADBC Client 需要连接 Doris BE 节点拉取查询结果,但 Doris BE 节点不允许被集群外部访问。 + +在生产环境中,很多时候不方便将 Doris BE 节点暴露到集群外。但可以为所有 Doris BE 节点增加了一层反向代理(比如 Nginx),集群外部的 Client 连接 Nginx 时会随机路由到一台 Doris BE 节点上。默认情况下,Arrow Flight SQL 查询结果会随机保存在一台 Doris BE 节点上,如果和 Nginx 随机路由的 Doris BE 节点不同,需要在 Doris BE 节点内部做一次数据转发。 + +自 Doris v2.1.8 开始,你可以在所有 Doris BE 节点的 `be.conf` 中将 `public_host` 和 `arrow_flight_sql_proxy_port` 配置成多 Doris BE 节点共享的可供集群外部访问的 IP 和端口,查询结果可以正确转发后返回 ADBC Client。 +```conf + public_host={nginx ip} + arrow_flight_sql_proxy_port={nginx port} +``` + ## FAQ -1. ARM 环境报错 `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`,如果 Linux 内核版本 <= 4.19.90,需要升级到 4.19.279 及以上,或者在低版本 Linux 内核的环境中重新编译 Doris BE,具体编译方法参考文档<docs/dev/install/source-install/compilation-arm> +1. Q: ARM 环境报错 `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`。 + + A: 如果 Linux 内核版本 <= 4.19.90,需要升级到 4.19.279 及以上,或者在低版本 Linux 内核的环境中重新编译 Doris BE,具体编译方法参考文档<docs/dev/install/source-install/compilation-arm> + + 问题原因:这是因为老版本 Linux 内核和 Arrow 存在兼容性问题,`cpp: arrow::RecordBatch::MakeEmpty()` 构造 Arrow Record Batch 时会卡住,导致 Doris BE 的 Arrow Flight Server 在 5000ms 内没有回应 Doris FE 的 Arrow Flight Server 的 RPC 请求,导致 FE 给 Client 返回 rpc timeout failed。Spark 和 Flink 读取 Doris 时也是将查询结果转换成 Arrow Record Batch 后返回,所以也存在同样的问题。 + + kylinv10 SP2 和 SP3 的 Linux 内核版本最高只有 4.19.90-24.4.v2101.ky10.aarch64,无法继续升级内核版本,只能在 kylinv10 上重新编译 Doris BE,如果使用新版本 ldb_toolchain 编译 Doris BE 后问题依然存在,可以尝试使用低版本 ldb_toolchain v0.17 编译,如果你的 ARM 环境无法连外网,华为云提供 ARM + kylinv10,阿里云提供 x86 + kylinv10 + +2. Q: prepared statement 传递参数报错。 + + A: 目前 `jdbc:arrow-flight-sql` 和 Java ADBC/JDBCDriver 不支持 prepared statement 传递参数,类似`select * from xxx where id=?`,将报错 `parameter ordinal 1 out of range`,这是 Arrow Flight SQL 的一个 BUG([GitHub Issue](https://github.com/apache/arrow/issues/40118))。 + +3. Q: 如何修改 `jdbc:arrow-flight-sql` 每次读取的批次大小,在某些场景下提升性能。 + + A: 通过修改`org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader`文件中`makeJdbcConfig`方法中的 `setTargetBatchSize`,默认是 1024,然后将修改后的文件保存到本地同名路径目录下,从而覆盖原文件生效。 + +4. Q: ADBC v0.10,JDBC 和 Java ADBC/JDBCDriver 不支持并行读取。 + + A: 没有实现`stmt.executePartitioned()`这个方法,只能使用原生的 FlightClient 实现并行读取多个 Endpoints, 使用方法`sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`,此外,ADBC V0.10 默认的 AdbcStatement 实际是 JdbcStatement,executeQuery 后将行存格式的 JDBC ResultSet 又重新转成的 Arrow 列存格式,预期到 ADBC 1.0.0 时 Java ADBC 将功能完善 [GitHub Issue](https://github.com/apache/arrow-adbc/issues/1490)。 + +5. Q: 在 URL 中指定 database name。 + + A: 截止 Arrow v15.0,Arrow JDBC Connector 不支持在 URL 中指定 database name,比如 `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` 中指定连接`test` database 无效,只能手动执行 SQL `use database`。Arrow v18.0 支持了在 URL 中指定 database name,但实测仍有 BUG。 + +6. Q: Python ADBC print `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant`。 + + A: 使用 Python 时忽略这个 Warning,这是 Python ADBC Client 的问题,不会影响查询。 + +7. Q: Python 报错 `grpc: received message larger than max (20748753 vs. 16777216)`。 + + A: 参考 [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) 在 Database Option 中增加 `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value`. + +8. Q: 报错 `invalid bearer token`。 + + A: 执行 `SET PROPERTY FOR 'root' 'max_user_connections' = '10000';` 修改当前用户的当前最大连接数到 10000;在 `fe.conf` 增加 qe_max_connection=30000 和 arrow_flight_token_cache_size=8000 并重启 FE。 + + ADBC Client 和 Arrow Flight Server 端之间的连接本质上是个长链接,需要在 Server 缓存 Auth Token、Connection、Session,连接创建后不会在单个查询结束时立即断开,需要 Client 发送 close() 请求后清理,但实际上 Client 经常不会发送 close 请求,所以 Auth Token、Connection、Session 会长时间在 Arrow Flight Server 上保存,默认会在 3 天后超时断开,或者在连接数超过 `arrow_flight_token_cache_size` 的限制后依据 LRU 淘汰。 + + 截止 Doris v2.1.8,Arrow Flight 连接和 Mysql/JDBC 连接使用相同的连接数限制,包括 FE 所有用户的总连接数 `qe_max_connection` 和单个用户的连接数 `UserProperty` 中的 `max_user_connections`。但默认的 `qe_max_connection` 和 `max_user_connections` 分别是 1024 和 100。Arrow Flight SQL 常用来取代使用 JDBC 的场景,但 JDBC 连接会在查询结束后立即释放,所以使用 Arrow Flight SQL 时,Doris 默认的连接数限制太小,经常导致连接数超过 `arrow_flight_token_cache_size` 的限制后将仍在被使用的连接淘汰。 + +9. Q: Java Arrow Flight SQL 读取 Datatime 类型返回时间戳,而不是格式化时间。 + + A: Java Arrow Flight SQL 读取 Datatime 类型需要自行转换时间戳,参考 [Add java parsing datetime type in arrow flight sql sample #48578](https://github.com/apache/doris/pull/48578)。用 Python Arrow Flight SQL 读取 Datatime 类型返回结果为 `2025-03-03 17:23:28Z`,而 Java Arrow Flight SQL 返回 `1740993808`。 + +10. Q: Java Arrow Flight JDBC Client 读取 Array 嵌套类型报错 `Configuration does not provide a mapping for array column 2`。 + + A: 参考 [`sample/arrow-flight-sql`](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/src/main/java/doris/arrowflight/demo/FlightAdbcDriver.java) 使用 JAVA ADBC Client。 + + Python ADBC Client、JAVA ADBC Client、Java JDBC DriverManager 读取 Array 嵌套类型都没问题,只有 Java Arrow Flight JDBC Client 有问题,实际上 Arrow Flight JDBC Client 的兼容性不好保证,不是 Arrow 官方开发的,由一个第三方数据库公司 Dremio 开发,之前还发现过其他兼容性问题,所以建议优先用 JAVA ADBC Client。 + +## Release Note + +> 自 Doris 2.1 开始支持 Arrow Flight SQL 协议,截止 Doris 2.1.9,以 Doris 2.1 系列版本为准列举修复的问题,Doris 3.0 系列版本自行对照。 + +### v2.1.9 + +1. 修复 Doris 数据序列化到 Arrow 的问题。 + [Fix UT DataTypeSerDeArrowTest of Array/Map/Struct/Bitmap/HLL/Decimal256 types](https://github.com/apache/doris/pull/48944) + - 读取 `Decimal256` 类型失败; + - 读取 `DatetimeV2` 类型微妙部分错误; + - 读取 `DateV2` 类型结果不正确; + - 读取 `IPV4/IPV6` 类型结果为 NULL 时报错; + +2. 修复 Doris Arrow Flight SQL 查询失败返回空结果,没有返回真实的错误信息。 + [Fix query result is empty and not return query error message](https://github.com/apache/doris/pull/45023) + +### v2.1.8 + +1. 支持 DBeaver 等 BI 工具使用 `arrow-flight-sql` 协议连接 Doris,支持正确显示元数据树。 + [Support arrow-flight-sql protocol getStreamCatalogs, getStreamSchemas, getStreamTables #46217](https://github.com/apache/doris/pull/46217)。 + +2. 支持多 BE 共享同一个可供集群外部访问的 IP 时,查询结果可以正确转发后返回 ADBC Client。 + [Arrow flight server supports data forwarding when BE uses public vip](https://github.com/apache/doris/pull/43281) + +3. 支持多个 Endpoint 并行读取。 + [Arrow Flight support multiple endpoints](https://github.com/apache/doris/pull/44286) + +4. 修复查询报错 `FE not found arrow flight schema`。 + [Fix FE not found arrow flight schema](https://github.com/apache/doris/pull/43960) + +5. 修复读取允许 NULL 的列报错 `BooleanBuilder::AppendValues`。 + [Fix Doris NULL column conversion to arrow batch](https://github.com/apache/doris/pull/43929) + +6. 修复 `show processlist` 显示重复的 Connection ID。 + [Fix arrow-flight-sql ConnectContext to use a unified ID #46284](https://github.com/apache/doris/pull/46284) + +### v2.1.7 + +1. 修复频繁打印日志 `Connection wait_timeout`。 + [Fix kill timeout FlightSqlConnection and FlightSqlConnectProcessor close](https://github.com/apache/doris/pull/41770) -问题原因:这是因为老版本 Linux 内核和 Arrow 存在兼容性问题,`cpp: arrow::RecordBatch::MakeEmpty()` 构造 Arrow Record Batch 时会卡住,导致 Doris BE 的 Arrow Flight Server 在 5000ms 内没有回应 Doris FE 的 Arrow Flight Server 的 RPC 请求,导致 FE 给 Client 返回 rpc timeout failed。Spark 和 Flink 读取 Doris 时也是将查询结果转换成 Arrow Record Batch 后返回,所以也存在同样的问题。 +2. 修复 Arrow Flight Bearer Token 过期后从 Cache 中淘汰。 + [Fix Arrow Flight bearer token cache evict after expired](https://github.com/apache/doris/pull/41754) -kylinv10 SP2 和 SP3 的 Linux 内核版本最高只有 4.19.90-24.4.v2101.ky10.aarch64,无法继续升级内核版本,只能在 kylinv10 上重新编译 Doris BE,如果使用新版本 ldb_toolchain 编译 Doris BE 后问题依然存在,可以尝试使用低版本 ldb_toolchain v0.17 编译,如果你的 ARM 环境无法连外网,华为云提供 ARM + kylinv10,阿里云提供 x86 + kylinv10 +### v2.1.6 -2. 目前 `jdbc:arrow-flight-sql` 和 Java ADBC/JDBCDriver 不支持 prepared statement 传递参数,类似`select * from xxx where id=?`,将报错 `parameter ordinal 1 out of range`,这是 Arrow Flight SQL 的一个 BUG([GitHub Issue](https://github.com/apache/arrow/issues/40118)) +1. 修复查询报错 `0.0.0.0:xxx, connection refused`。 + [Fix return result from FE Arrow Flight server error 0.0.0.0:xxx, connection refused](https://github.com/apache/doris/pull/40002) -3. 修改 `jdbc:arrow-flight-sql` 每次读取的批次大小,在某些场景下可以提升性能,通过修改`org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader`文件中`makeJdbcConfig`方法中的 `setTargetBatchSize`,默认是 1024,然后将修改后的文件保存到本地同名路径目录下,从而覆盖原文件生效。 +2. 修复查询报错 `Reach limit of connections`。 + [Fix exceed user property max connection cause Reach limit of connections #39127](https://github.com/apache/doris/pull/39127) + + 之前的版本执行 `SET PROPERTY FOR 'root' 'max_user_connections' = '1024';` 修改当前用户的当前最大连接数到 1024,可临时规避。 -4. ADBC v0.10,JDBC 和 Java ADBC/JDBCDriver 还不支持并行读取,没有实现`stmt.executePartitioned()`这个方法,只能使用原生的 FlightClient 实现并行读取多个 Endpoints, 使用方法`sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`,此外,ADBC V0.10 默认的 AdbcStatement 实际是 JdbcStatement,executeQuery 后将行存格式的 JDBC ResultSet 又重新转成的 Arrow 列存格式,预期到 ADBC 1.0.0 时 Java ADBC 将功能完善 [GitHub Issue](https://github.com/apache/arrow-adbc/issues/1490)。 + 因为之前的版本只限制 Arrow Flight 连接数小于 `qe_max_connection/2`,`qe_max_connection` 是 fe 所有用户的总连接数,默认 1024,没有限制单个用户的 Arrow Flight 连接数小于 `UserProperty` 中的 `max_user_connections`,默认 100,所以当 Arrow Flight 连接数超过当前用户连接数上限时将报错 `Reach limit of connections`,所以需调大当前用户的 `max_user_connections`。 + + 问题详情见:[Questions](https://ask.selectdb.com/questions/D18b1/2-1-4-ban-ben-python-shi-yong-arrow-flight-sql-lian-jie-bu-hui-duan-kai-lian-jie-shu-zhan-man-da-dao-100/E1ic1?commentId=10070000000005324) -5. 截止 Arrow v15.0,Arrow JDBC Connector 不支持在 URL 中指定 database name,比如 `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` 中指定连接`test` database 无效,只能手动执行 SQL `use database`。 +3. 增加 Conf `arrow_flight_result_sink_buffer_size_rows`,支持修改单次返回的查询结果 ArrowBatch 大小,默认 4096 * 8。 + [Add config arrow_flight_result_sink_buffer_size_rows](https://github.com/apache/doris/pull/38221) -6. Doris 2.1.4 version 存在一个 Bug,读取大数据量时有几率报错,在 [Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) 这个 pr 修复,升级 Doris 2.1.5 version 可以解决。问题详情见:[Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) +### v2.1.5 -7. `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant` 使用 Python 时忽略这个 Warning,这是 Python ADBC Client 的问题,这不会影响查询。 +1. 修复Arrow Flight SQL 查询结果为空。 + [Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) -8. Python 报错 `grpc: received message larger than max (20748753 vs. 16777216)`,参考 [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) 在 Database Option 中增加 `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value`. + Doris v2.1.4 读取大数据量时有几率报错,问题详情见:[Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) -9. Doris version 2.1.7 版本之前,报错 `Reach limit of connections`,这是因为没有限制单个用户的 Arrow Flight 连接数小于 `UserProperty` 中的 `max_user_connections`,默认 100,可以通过 `SET PROPERTY FOR 'Billie' 'max_user_connections' = '1000';` 修改 Billie 用户的当前最大连接数到 100,或者在 `fe.conf` 中增加 `arrow_flight_token_cache_size=50` 来限制整体的 Arrow Flight 连接数。Doris version 2.1.7 版本之前 Arrow Flight 连接默认 3 天 超时断开,只强制连接数小于 `qe_max_connection/2`,超过时依据 lru 淘汰,`qe_max_connection` 是 fe 所有用户的总连接数,默认 1024。具体可以看 `arrow_flight_token_cache_size` 这个 co [...] +### v2.1.4 及之前的版本 Doris Arrow Flight 不够完善,建议升级后使用。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/db-connect/arrow-flight-sql-connect.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/db-connect/arrow-flight-sql-connect.md index df5765507da..19b5abaaba4 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/db-connect/arrow-flight-sql-connect.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/db-connect/arrow-flight-sql-connect.md @@ -451,7 +451,9 @@ try ( ### JDBC 和 Java 连接方式的选择 -对比传统的 `jdbc:mysql` 连接方式,Jdbc 和 Java 的 Arrow Flight SQL 连接方式的性能测试见 [GitHub Issue](https://github.com/apache/doris/issues/25514),这里基于测试结论给出一些使用建议。 +[JDBC/Java Arrow Flight SQL Sample](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/README.md) 是 JDBC/Java 使用 Arrow FLight SQL 的 demo,你可以使用它来测试向 Arrow Flight Server 发送查询的多种连接方法,帮助你了解如何使用 Arrow FLight SQL 并测试性能。在 [Add Arrow Flight Sql demo for Java](https://github.com/apache/doris/pull/45306) 中实现。 + +对比传统的 `jdbc:mysql` 连接方式,Jdbc 和 Java 的 Arrow Flight SQL 连接方式的性能测试见 Section 6.2 of [GitHub Issue](https://github.com/apache/doris/issues/25514),这里基于测试结论给出一些使用建议。 1. 上述三种 Java Arrow Flight SQL 连接方式的选择上,如果后续的数据分析将基于行存的数据格式,那么使用 jdbc:arrow-flight-sql,这将返回 JDBC ResultSet 格式的数据;如果后续的数据分析可以基于 Arrow 格式或其他列存数据格式,那么使用 Flight AdbcDriver 或 Flight JdbcDriver 直接返回 Arrow 格式的数据,这将避免行列转换,并可利用 Arrow 的特性加速数据解析。 @@ -471,26 +473,152 @@ Arrow Flight 官方目前没有支持 Spark 和 Flink 的计划(见 [GitHub Is 社区之前参考开源的 [Spark-Flight-Connector](https://github.com/qwshen/spark-flight-connector),在 Spark 中使用 FlightClient 连接 Doris 测试,发现 Arrow 与 Doris Block 之间数据格式转换的速度更快,是 CSV 格式与 Doris Block 之间转换速度的 10 倍,而且对 Map,Array 等复杂类型的支持更好,这是因为 Arrow 数据格式的压缩率高,传输时网络开销小。不过目前 Doris Arrow Flight 还没有实现多节点并行读取,仍是将查询结果汇总到一台 BE 节点后返回,对简单的批量导出数据而言,性能可能没有 Doris Spark Connector 快,后者支持 Tablet 级别的并行读取。如果你希望在 Spark 使用 Arrow Flight SQL 连接 Doris,可以参考开源的 [Spark-Flight-Connector](https://github.com/qwshen/spark-flight-connec [...] +### 支持 BI 工具 + +自 Doris v2.1.8 开始,支持 DBeaver 等 BI 工具使用 `arrow-flight-sql` 协议连接 Doris。DBeaver 使用 `arrow-flight-sql` Driver 连接 Doris 的方法参考: [how-to-use-jdbc-driver-with-dbeaver-client](https://www.dremio.com/blog/jdbc-driver-for-arrow-flight-sql/#h-how-to-use-jdbc-driver-with-dbeaver-client),[client-applications/clients/dbeaver/](https://docs.dremio.com/current/sonar/client-applications/clients/dbeaver/?_gl=1*1epgwh0*_gcl_au*MjUyNjE1ODM0LjE3MzQwMDExNDg)。 + +## 扩展应用 + +### 多 BE 并行返回结果 + +Doris 默认会将一个查询在所有 BE 节点上的结果汇总聚合到一个 BE 节点上,在 Mysql/JDBC 查询中 FE 会向这个汇总数据的节点请求查询结果,在 Arrow Flight SQL 查询中 FE 会将这个节点的 IP/Port 包装在 Endpoint 中返回给 ADBC Client,ADBC Client 会请求这个 Endpoint 对应的 BE 节点拉取数据。 + +如果查询只是简单的 Select 从 Doris 拉取数据,没有 Join、Sort、窗口函数等有数据 Shuffle 行为的算子,可以将查询按照 Tablet 粒度拆分,现在 Doris Spark/Flink Connector 就是用的这个方法实现并行读取数据,分为两个步骤: + 1. 执行 `explain sql`,FE 返回的查询计划中 ScanOperator 包含 Scan 的所有 Tablet ID List。 + 2. 依据上面的 Tablet ID List 将原始 SQL 拆分为多个 SQL,每个 SQL 只读取部分 Tablet,用法类似 `SELECT * FROM t1 TABLET(10001,10002) limit 1000;`,拆分后的多个 SQL 可以并行执行。参考 [Support select table sample](https://github.com/apache/doris/pull/10170)。 + +如果查询最外层是聚合,SQL 类似 `select k1, sum(k2) from xxx group by k1`,Doris v3.4 版本后,执行 `set enable_parallel_result_sink=true;` 后允许一个查询的每个 BE 节点独立返回查询结果,ADBC Client 收到 FE 返回的 Endpoint 列表后并行从多个 BE 节点拉取结果。不过注意当聚合结果很小时,多 BE 返回会增加 RPC 的压力。具体实现参考 [support parallel result sink](https://github.com/apache/doris/pull/36053)。理论上除了最外层是排序的查询,其他查询都可以支持每个 BE 节点并行返回结果,不过暂时没有这方便的需求,没有更进一步实现。 + +### 多 BE 共享同一个可供集群外部访问的 IP + +如果存在一个 Doris 集群,它的 FE 节点可以被集群外部访问,它的所有 BE 节点只可以被集群内部访问。这在使用 Mysql Client 和 JDBC 连接 Doris 执行查询是没问题的,查询结果将由 Doris FE 节点返回给。但使用 Arrow Flight SQL 连接 Doris 无法执行查询,因为 ADBC Client 需要连接 Doris BE 节点拉取查询结果,但 Doris BE 节点不允许被集群外部访问。 + +在生产环境中,很多时候不方便将 Doris BE 节点暴露到集群外。但可以为所有 Doris BE 节点增加了一层反向代理(比如 Nginx),集群外部的 Client 连接 Nginx 时会随机路由到一台 Doris BE 节点上。默认情况下,Arrow Flight SQL 查询结果会随机保存在一台 Doris BE 节点上,如果和 Nginx 随机路由的 Doris BE 节点不同,需要在 Doris BE 节点内部做一次数据转发。 + +自 Doris v2.1.8 开始,你可以在所有 Doris BE 节点的 `be.conf` 中将 `public_host` 和 `arrow_flight_sql_proxy_port` 配置成多 Doris BE 节点共享的可供集群外部访问的 IP 和端口,查询结果可以正确转发后返回 ADBC Client。 +```conf + public_host={nginx ip} + arrow_flight_sql_proxy_port={nginx port} +``` + ## FAQ -1. ARM 环境报错 `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`,如果 Linux 内核版本 <= 4.19.90,需要升级到 4.19.279 及以上,或者在低版本 Linux 内核的环境中重新编译 Doris BE,具体编译方法参考文档<docs/dev/install/source-install/compilation-arm> +1. Q: ARM 环境报错 `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`。 + + A: 如果 Linux 内核版本 <= 4.19.90,需要升级到 4.19.279 及以上,或者在低版本 Linux 内核的环境中重新编译 Doris BE,具体编译方法参考文档<docs/dev/install/source-install/compilation-arm> + + 问题原因:这是因为老版本 Linux 内核和 Arrow 存在兼容性问题,`cpp: arrow::RecordBatch::MakeEmpty()` 构造 Arrow Record Batch 时会卡住,导致 Doris BE 的 Arrow Flight Server 在 5000ms 内没有回应 Doris FE 的 Arrow Flight Server 的 RPC 请求,导致 FE 给 Client 返回 rpc timeout failed。Spark 和 Flink 读取 Doris 时也是将查询结果转换成 Arrow Record Batch 后返回,所以也存在同样的问题。 + + kylinv10 SP2 和 SP3 的 Linux 内核版本最高只有 4.19.90-24.4.v2101.ky10.aarch64,无法继续升级内核版本,只能在 kylinv10 上重新编译 Doris BE,如果使用新版本 ldb_toolchain 编译 Doris BE 后问题依然存在,可以尝试使用低版本 ldb_toolchain v0.17 编译,如果你的 ARM 环境无法连外网,华为云提供 ARM + kylinv10,阿里云提供 x86 + kylinv10 + +2. Q: prepared statement 传递参数报错。 + + A: 目前 `jdbc:arrow-flight-sql` 和 Java ADBC/JDBCDriver 不支持 prepared statement 传递参数,类似`select * from xxx where id=?`,将报错 `parameter ordinal 1 out of range`,这是 Arrow Flight SQL 的一个 BUG([GitHub Issue](https://github.com/apache/arrow/issues/40118))。 + +3. Q: 如何修改 `jdbc:arrow-flight-sql` 每次读取的批次大小,在某些场景下提升性能。 + + A: 通过修改`org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader`文件中`makeJdbcConfig`方法中的 `setTargetBatchSize`,默认是 1024,然后将修改后的文件保存到本地同名路径目录下,从而覆盖原文件生效。 + +4. Q: ADBC v0.10,JDBC 和 Java ADBC/JDBCDriver 不支持并行读取。 + + A: 没有实现`stmt.executePartitioned()`这个方法,只能使用原生的 FlightClient 实现并行读取多个 Endpoints, 使用方法`sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`,此外,ADBC V0.10 默认的 AdbcStatement 实际是 JdbcStatement,executeQuery 后将行存格式的 JDBC ResultSet 又重新转成的 Arrow 列存格式,预期到 ADBC 1.0.0 时 Java ADBC 将功能完善 [GitHub Issue](https://github.com/apache/arrow-adbc/issues/1490)。 + +5. Q: 在 URL 中指定 database name。 + + A: 截止 Arrow v15.0,Arrow JDBC Connector 不支持在 URL 中指定 database name,比如 `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` 中指定连接`test` database 无效,只能手动执行 SQL `use database`。Arrow v18.0 支持了在 URL 中指定 database name,但实测仍有 BUG。 + +6. Q: Python ADBC print `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant`。 + + A: 使用 Python 时忽略这个 Warning,这是 Python ADBC Client 的问题,不会影响查询。 + +7. Q: Python 报错 `grpc: received message larger than max (20748753 vs. 16777216)`。 + + A: 参考 [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) 在 Database Option 中增加 `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value`. + +8. Q: 报错 `invalid bearer token`。 + + A: 执行 `SET PROPERTY FOR 'root' 'max_user_connections' = '10000';` 修改当前用户的当前最大连接数到 10000;在 `fe.conf` 增加 qe_max_connection=30000 和 arrow_flight_token_cache_size=8000 并重启 FE。 + + ADBC Client 和 Arrow Flight Server 端之间的连接本质上是个长链接,需要在 Server 缓存 Auth Token、Connection、Session,连接创建后不会在单个查询结束时立即断开,需要 Client 发送 close() 请求后清理,但实际上 Client 经常不会发送 close 请求,所以 Auth Token、Connection、Session 会长时间在 Arrow Flight Server 上保存,默认会在 3 天后超时断开,或者在连接数超过 `arrow_flight_token_cache_size` 的限制后依据 LRU 淘汰。 + + 截止 Doris v2.1.8,Arrow Flight 连接和 Mysql/JDBC 连接使用相同的连接数限制,包括 FE 所有用户的总连接数 `qe_max_connection` 和单个用户的连接数 `UserProperty` 中的 `max_user_connections`。但默认的 `qe_max_connection` 和 `max_user_connections` 分别是 1024 和 100。Arrow Flight SQL 常用来取代使用 JDBC 的场景,但 JDBC 连接会在查询结束后立即释放,所以使用 Arrow Flight SQL 时,Doris 默认的连接数限制太小,经常导致连接数超过 `arrow_flight_token_cache_size` 的限制后将仍在被使用的连接淘汰。 + +9. Q: Java Arrow Flight SQL 读取 Datatime 类型返回时间戳,而不是格式化时间。 + + A: Java Arrow Flight SQL 读取 Datatime 类型需要自行转换时间戳,参考 [Add java parsing datetime type in arrow flight sql sample #48578](https://github.com/apache/doris/pull/48578)。用 Python Arrow Flight SQL 读取 Datatime 类型返回结果为 `2025-03-03 17:23:28Z`,而 Java Arrow Flight SQL 返回 `1740993808`。 + +10. Q: Java Arrow Flight JDBC Client 读取 Array 嵌套类型报错 `Configuration does not provide a mapping for array column 2`。 + + A: 参考 [`sample/arrow-flight-sql`](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/src/main/java/doris/arrowflight/demo/FlightAdbcDriver.java) 使用 JAVA ADBC Client。 + + Python ADBC Client、JAVA ADBC Client、Java JDBC DriverManager 读取 Array 嵌套类型都没问题,只有 Java Arrow Flight JDBC Client 有问题,实际上 Arrow Flight JDBC Client 的兼容性不好保证,不是 Arrow 官方开发的,由一个第三方数据库公司 Dremio 开发,之前还发现过其他兼容性问题,所以建议优先用 JAVA ADBC Client。 + +## Release Note + +> 自 Doris 2.1 开始支持 Arrow Flight SQL 协议,截止 Doris 2.1.9,以 Doris 2.1 系列版本为准列举修复的问题,Doris 3.0 系列版本自行对照。 + +### v2.1.9 + +1. 修复 Doris 数据序列化到 Arrow 的问题。 + [Fix UT DataTypeSerDeArrowTest of Array/Map/Struct/Bitmap/HLL/Decimal256 types](https://github.com/apache/doris/pull/48944) + - 读取 `Decimal256` 类型失败; + - 读取 `DatetimeV2` 类型微妙部分错误; + - 读取 `DateV2` 类型结果不正确; + - 读取 `IPV4/IPV6` 类型结果为 NULL 时报错; + +2. 修复 Doris Arrow Flight SQL 查询失败返回空结果,没有返回真实的错误信息。 + [Fix query result is empty and not return query error message](https://github.com/apache/doris/pull/45023) + +### v2.1.8 + +1. 支持 DBeaver 等 BI 工具使用 `arrow-flight-sql` 协议连接 Doris,支持正确显示元数据树。 + [Support arrow-flight-sql protocol getStreamCatalogs, getStreamSchemas, getStreamTables #46217](https://github.com/apache/doris/pull/46217)。 + +2. 支持多 BE 共享同一个可供集群外部访问的 IP 时,查询结果可以正确转发后返回 ADBC Client。 + [Arrow flight server supports data forwarding when BE uses public vip](https://github.com/apache/doris/pull/43281) + +3. 支持多个 Endpoint 并行读取。 + [Arrow Flight support multiple endpoints](https://github.com/apache/doris/pull/44286) + +4. 修复查询报错 `FE not found arrow flight schema`。 + [Fix FE not found arrow flight schema](https://github.com/apache/doris/pull/43960) + +5. 修复读取允许 NULL 的列报错 `BooleanBuilder::AppendValues`。 + [Fix Doris NULL column conversion to arrow batch](https://github.com/apache/doris/pull/43929) + +6. 修复 `show processlist` 显示重复的 Connection ID。 + [Fix arrow-flight-sql ConnectContext to use a unified ID #46284](https://github.com/apache/doris/pull/46284) + +### v2.1.7 + +1. 修复频繁打印日志 `Connection wait_timeout`。 + [Fix kill timeout FlightSqlConnection and FlightSqlConnectProcessor close](https://github.com/apache/doris/pull/41770) -问题原因:这是因为老版本 Linux 内核和 Arrow 存在兼容性问题,`cpp: arrow::RecordBatch::MakeEmpty()` 构造 Arrow Record Batch 时会卡住,导致 Doris BE 的 Arrow Flight Server 在 5000ms 内没有回应 Doris FE 的 Arrow Flight Server 的 RPC 请求,导致 FE 给 Client 返回 rpc timeout failed。Spark 和 Flink 读取 Doris 时也是将查询结果转换成 Arrow Record Batch 后返回,所以也存在同样的问题。 +2. 修复 Arrow Flight Bearer Token 过期后从 Cache 中淘汰。 + [Fix Arrow Flight bearer token cache evict after expired](https://github.com/apache/doris/pull/41754) -kylinv10 SP2 和 SP3 的 Linux 内核版本最高只有 4.19.90-24.4.v2101.ky10.aarch64,无法继续升级内核版本,只能在 kylinv10 上重新编译 Doris BE,如果使用新版本 ldb_toolchain 编译 Doris BE 后问题依然存在,可以尝试使用低版本 ldb_toolchain v0.17 编译,如果你的 ARM 环境无法连外网,华为云提供 ARM + kylinv10,阿里云提供 x86 + kylinv10 +### v2.1.6 -2. 目前 `jdbc:arrow-flight-sql` 和 Java ADBC/JDBCDriver 不支持 prepared statement 传递参数,类似`select * from xxx where id=?`,将报错 `parameter ordinal 1 out of range`,这是 Arrow Flight SQL 的一个 BUG([GitHub Issue](https://github.com/apache/arrow/issues/40118)) +1. 修复查询报错 `0.0.0.0:xxx, connection refused`。 + [Fix return result from FE Arrow Flight server error 0.0.0.0:xxx, connection refused](https://github.com/apache/doris/pull/40002) -3. 修改 `jdbc:arrow-flight-sql` 每次读取的批次大小,在某些场景下可以提升性能,通过修改`org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader`文件中`makeJdbcConfig`方法中的 `setTargetBatchSize`,默认是 1024,然后将修改后的文件保存到本地同名路径目录下,从而覆盖原文件生效。 +2. 修复查询报错 `Reach limit of connections`。 + [Fix exceed user property max connection cause Reach limit of connections #39127](https://github.com/apache/doris/pull/39127) + + 之前的版本执行 `SET PROPERTY FOR 'root' 'max_user_connections' = '1024';` 修改当前用户的当前最大连接数到 1024,可临时规避。 -4. ADBC v0.10,JDBC 和 Java ADBC/JDBCDriver 还不支持并行读取,没有实现`stmt.executePartitioned()`这个方法,只能使用原生的 FlightClient 实现并行读取多个 Endpoints, 使用方法`sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`,此外,ADBC V0.10 默认的 AdbcStatement 实际是 JdbcStatement,executeQuery 后将行存格式的 JDBC ResultSet 又重新转成的 Arrow 列存格式,预期到 ADBC 1.0.0 时 Java ADBC 将功能完善 [GitHub Issue](https://github.com/apache/arrow-adbc/issues/1490)。 + 因为之前的版本只限制 Arrow Flight 连接数小于 `qe_max_connection/2`,`qe_max_connection` 是 fe 所有用户的总连接数,默认 1024,没有限制单个用户的 Arrow Flight 连接数小于 `UserProperty` 中的 `max_user_connections`,默认 100,所以当 Arrow Flight 连接数超过当前用户连接数上限时将报错 `Reach limit of connections`,所以需调大当前用户的 `max_user_connections`。 + + 问题详情见:[Questions](https://ask.selectdb.com/questions/D18b1/2-1-4-ban-ben-python-shi-yong-arrow-flight-sql-lian-jie-bu-hui-duan-kai-lian-jie-shu-zhan-man-da-dao-100/E1ic1?commentId=10070000000005324) -5. 截止 Arrow v15.0,Arrow JDBC Connector 不支持在 URL 中指定 database name,比如 `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` 中指定连接`test` database 无效,只能手动执行 SQL `use database`。 +3. 增加 Conf `arrow_flight_result_sink_buffer_size_rows`,支持修改单次返回的查询结果 ArrowBatch 大小,默认 4096 * 8。 + [Add config arrow_flight_result_sink_buffer_size_rows](https://github.com/apache/doris/pull/38221) -6. Doris 2.1.4 version 存在一个 Bug,读取大数据量时有几率报错,在 [Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) 这个 pr 修复,升级 Doris 2.1.5 version 可以解决。问题详情见:[Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) +### v2.1.5 -7. `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant` 使用 Python 时忽略这个 Warning,这是 Python ADBC Client 的问题,这不会影响查询。 +1. 修复Arrow Flight SQL 查询结果为空。 + [Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) -8. Python 报错 `grpc: received message larger than max (20748753 vs. 16777216)`,参考 [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) 在 Database Option 中增加 `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value`. + Doris v2.1.4 读取大数据量时有几率报错,问题详情见:[Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) -9. Doris version 2.1.7 版本之前,报错 `Reach limit of connections`,这是因为没有限制单个用户的 Arrow Flight 连接数小于 `UserProperty` 中的 `max_user_connections`,默认 100,可以通过 `SET PROPERTY FOR 'Billie' 'max_user_connections' = '1000';` 修改 Billie 用户的当前最大连接数到 100,或者在 `fe.conf` 中增加 `arrow_flight_token_cache_size=50` 来限制整体的 Arrow Flight 连接数。Doris version 2.1.7 版本之前 Arrow Flight 连接默认 3 天 超时断开,只强制连接数小于 `qe_max_connection/2`,超过时依据 lru 淘汰,`qe_max_connection` 是 fe 所有用户的总连接数,默认 1024。具体可以看 `arrow_flight_token_cache_size` 这个 co [...] +### v2.1.4 及之前的版本 Doris Arrow Flight 不够完善,建议升级后使用。 diff --git a/versioned_docs/version-2.1/db-connect/arrow-flight-sql-connect.md b/versioned_docs/version-2.1/db-connect/arrow-flight-sql-connect.md index a122ff16b8d..b2a10f63c73 100644 --- a/versioned_docs/version-2.1/db-connect/arrow-flight-sql-connect.md +++ b/versioned_docs/version-2.1/db-connect/arrow-flight-sql-connect.md @@ -451,7 +451,9 @@ try ( ### Choice of Jdbc and Java connection methods -Compared with the traditional `jdbc:mysql` connection method, the performance test of the Arrow Flight SQL connection method of Jdbc and Java can be found at [GitHub Issue](https://github.com/apache/doris/issues/25514). Here are some usage suggestions based on the test conclusions. +[JDBC/Java Arrow Flight SQL Sample](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/README.md) is a JDBC/Java demo using Arrow FLight SQL. You can use it to test various connection methods for sending queries to Arrow Flight Server, helping you understand how to use Arrow FLight SQL and test performance. Implemented in [Add Arrow Flight Sql demo for Java](https://github.com/apache/doris/pull/45306). + +Compared with the traditional `jdbc:mysql` connection method, the performance test of the Arrow Flight SQL connection method of Jdbc and Java can be found in Section 6.2 of [GitHub Issue](https://github.com/apache/doris/issues/25514). Here are some usage suggestions based on the test conclusions. 1. For the above three Java Arrow Flight SQL connection methods, if the subsequent data analysis will be based on the row-based data format, then use jdbc:arrow-flight-sql, which will return data in the JDBC ResultSet format; if the subsequent data analysis can be based on the Arrow format or other column-based data formats, then use Flight AdbcDriver or Flight JdbcDriver to directly return data in the Arrow format, which will avoid row-column conversion and use the characteristics of Ar [...] @@ -471,26 +473,152 @@ Arrow Flight currently has no official plan to support Spark and Flink ([GitHub The community previously referred to the open source [Spark-Flight-Connector](https://github.com/qwshen/spark-flight-connector) and used FlightClient in Spark to connect to Doris for testing. It was found that the data format conversion between Arrow and Doris Block is faster, which is 10 times the conversion speed between CSV format and Doris Block, and it has better support for complex types such as Map and Array. This is because the Arrow data format has a high compression rate and lo [...] +### Support BI tools + +Since Doris v2.1.8, BI tools such as DBeaver are supported to connect to Doris using the `arrow-flight-sql` protocol. For DBeaver's method of connecting to Doris using the `arrow-flight-sql` Driver, refer to: [how-to-use-jdbc-driver-with-dbeaver-client](https://www.dremio.com/blog/jdbc-driver-for-arrow-flight-sql/#h-how-to-use-jdbc-driver-with-dbeaver-client), [client-applications/clients/dbeaver/](https://docs.dremio.com/current/sonar/client-applications/clients/dbeaver/?_gl=1*1epgwh0*_ [...] + +## Extended Application + +### Multiple BEs return results in parallel + +Doris will aggregate the results of a query on all BE nodes to one BE node by default. In Mysql/JDBC queries, FE will request query results from this aggregated data node. In Arrow Flight SQL queries, FE will wrap the IP/Port of this node in the Endpoint and return it to ADBC Client. ADBC Client will request the BE node corresponding to this Endpoint to pull data. + +If the query is just a simple Select to pull data from Doris, without Join, Sort, Window Function and other operators with data Shuffle behavior, the query can be split according to Tablet granularity. Now Doris Spark/Flink Connector uses this method to implement parallel data reading, which is divided into two steps: +1. Execute `explain sql`, and the ScanOperator in the query plan returned by FE contains all Tablet ID Lists of Scan. +2. Split the original SQL into multiple SQLs based on the Tablet ID List above. Each SQL only reads part of the Tablet. The usage is similar to `SELECT * FROM t1 TABLET(10001,10002) limit 1000;`. The multiple SQLs after splitting can be executed in parallel. Refer to [Support select table sample](https://github.com/apache/doris/pull/10170). + +If the outermost layer of the query is aggregation, the SQL is similar to `select k1, sum(k2) from xxx group by k1`. After Doris v3.4, execute `set enable_parallel_result_sink=true;` to allow each BE node of a query to return query results independently. After receiving the Endpoint list returned by FE, ADBC Client pulls results from multiple BE nodes in parallel. However, please note that when the aggregation result is very small, returning multiple BEs will increase the pressure on R [...] + +### Multiple BEs share the same IP accessible from outside the cluster + +If there is a Doris cluster, its FE node can be accessed from outside the cluster, and all its BE nodes can only be accessed from inside the cluster. This is fine when using Mysql Client and JDBC to connect to Doris to execute queries, and the query results will be returned by the Doris FE node. However, using Arrow Flight SQL to connect to Doris cannot execute queries, because ADBC Client needs to connect to the Doris BE node to pull query results, but the Doris BE node is not allowed [...] + +In a production environment, it is often inconvenient to expose the Doris BE node outside the cluster. However, you can add a reverse proxy (such as Nginx) to all Doris BE nodes. When the client outside the cluster connects to Nginx, it will be randomly routed to a Doris BE node. By default, the Arrow Flight SQL query results will be randomly saved on a Doris BE node. If it is different from the Doris BE node randomly routed by Nginx, data forwarding is required within the Doris BE node. + +Starting from Doris v2.1.8, you can configure `public_host` and `arrow_flight_sql_proxy_port` in `be.conf` of all Doris BE nodes to the IP and port shared by multiple Doris BE nodes and accessible outside the cluster. The query results can be correctly forwarded and returned to the ADBC Client. +```conf +public_host={nginx ip} +arrow_flight_sql_proxy_port={nginx port} +``` + ## FAQ -1. ARM environment reports an error `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`. If the Linux kernel version is <= 4.19.90, you need to upgrade to 4.19.279 or above, or recompile Doris BE in the environment of the lower version of the Linux kernel. For specific compilation methods, refer to the document <docs/dev/install/source-install/compilation-arm> +1. Q: ARM environment reports an error `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`. + +A: If the Linux kernel version is <= 4.19.90, you need to upgrade to 4.19.279 or above, or recompile Doris BE in the environment of the lower version of the Linux kernel. For specific compilation methods, refer to the document <docs/dev/install/source-install/compilation-arm> + +Cause: This is because there is a compatibility issue between the old version of the Linux kernel and Arrow. `cpp: arrow::RecordBatch::MakeEmpty()` will get stuck when constructing Arrow Record Batch, causing Doris BE's Arrow Flight Server to fail to respond to Doris FE's Arrow Flight Server's RPC request within 5000ms, causing FE to return rpc timeout failed to Client. When Spark and Flink read Doris, they also convert the query results into Arrow Record Batch and return them, so the sa [...] + +The Linux kernel version of kylinv10 SP2 and SP3 is only 4.19.90-24.4.v2101.ky10.aarch64 at most. The kernel version cannot be upgraded further. Doris BE can only be recompiled on kylinv10. If the problem still exists after compiling Doris BE with the new version of ldb_toolchain, you can try to compile it with the lower version of ldb_toolchain v0.17. If your ARM environment cannot connect to the external network, Huawei Cloud provides ARM + kylinv10, and Alibaba Cloud provides x86 + kylinv10 + +2. Q: Prepared statement passes parameters and reports errors. + +A: Currently, `jdbc:arrow-flight-sql` and Java ADBC/JDBCDriver do not support prepared statement parameter passing. For example, `select * from xxx where id=?` will report an error `parameter ordinal 1 out of range`. This is a bug in Arrow Flight SQL ([GitHub Issue](https://github.com/apache/arrow/issues/40118)). + +3. Q: How to modify the batch size read by `jdbc:arrow-flight-sql` each time to improve performance in some scenarios. + +A: By modifying `setTargetBatchSize` in the `makeJdbcConfig` method in the `org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader` file, the default is 1024, and then saving the modified file to the local directory with the same path name, so as to overwrite the original file and take effect. + +4. Q: ADBC v0.10, JDBC and Java ADBC/JDBCDriver do not support parallel reading. + +A: The `stmt.executePartitioned()` method is not implemented. You can only use the native FlightClient to implement parallel reading of multiple endpoints, using the method `sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`. In addition, the default AdbcStatement of ADBC V0.10 is actually JdbcStatement. After executeQuery, the row-format JDBC ResultSet is converted back to the Arrow column format. I [...] + +5. Q: Specify the database name in the URL. + +A: As of Arrow v15.0, Arrow JDBC Connector does not support specifying database name in URL. For example, specifying connection to `test` database in `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` is invalid, and you can only execute SQL `use database` manually. Arrow v18.0 supports specifying database name in URL, but there are still bugs in actual testing. + +6. Q: Python ADBC prints `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant`. + +A: Ignore this Warning when using Python. This is a problem with Python ADBC Client and will not affect queries. + +7. Q: Python reports an error `grpc: received message larger than max (20748753 vs. 16777216)`. + +A: Refer to [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) and add `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value` in Database Option. + +8. Q: Error `invalid bearer token` is reported. + +A: Execute `SET PROPERTY FOR 'root' 'max_user_connections' = '10000';` to change the current maximum number of connections for the current user to 10000; add qe_max_connection=30000 and arrow_flight_token_cache_size=8000 in `fe.conf` and restart FE. + +The connection between the ADBC Client and the Arrow Flight Server is essentially a long link, which requires Auth Token, Connection, and Session to be cached on the Server. After the connection is created, it will not be disconnected immediately at the end of a single query. The Client needs to send a close() request to clean it up, but in fact, the Client often does not send a close request, so the Auth Token, Connection, and Session will be saved on the Arrow Flight Server for a lon [...] + +As of Doris v2.1.8, Arrow Flight connections and Mysql/JDBC connections use the same connection limit, including the total number of connections of all FE users `qe_max_connection` and the number of connections of a single user `max_user_connections` in `UserProperty`. But the default `qe_max_connection` and `max_user_connections` are 1024 and 100 respectively. Arrow Flight SQL is often used to replace JDBC scenarios, but the JDBC connection will be released immediately after the query e [...] + +9. Q: Java Arrow Flight SQL reads the Datatime type and returns a timestamp instead of the formatted time. + +A: Java Arrow Flight SQL needs to convert the timestamp by itself when reading the Datatime type. Refer to [Add java parsing datetime type in arrow flight sql sample #48578](https://github.com/apache/doris/pull/48578). Using Python Arrow Flight SQL to read the Datatime type returns `2025-03-03 17:23:28Z`, while Java Arrow Flight SQL returns `1740993808`. + +10. Q: Java Arrow Flight JDBC Client reports an error `Configuration does not provide a mapping for array column 2` when reading Array nested types. + +A: Refer to [`sample/arrow-flight-sql`](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/src/main/java/doris/arrowflight/demo/FlightAdbcDriver.java) to use JAVA ADBC Client. + +Python ADBC Client, JAVA ADBC Client, and Java JDBC DriverManager all have no problem reading Array nested types. Only Java Arrow Flight JDBC Client has problems. In fact, the compatibility of Arrow Flight JDBC Client is not guaranteed. It is not officially developed by Arrow, but by a third-party database company Dremio. Other compatibility issues have been found before, so it is recommended to use JAVA ADBC Client first. + +## Release Note + +> Arrow Flight SQL protocol is supported since Doris 2.1. As of Doris 2.1.9, the issues fixed are listed based on Doris 2.1 series versions. Doris 3.0 series versions are self-checked. + +### v2.1.9 + +1. Fix the problem of Doris data serialization to Arrow. +[Fix UT DataTypeSerDeArrowTest of Array/Map/Struct/Bitmap/HLL/Decimal256 types](https://github.com/apache/doris/pull/48944) +- Failed to read `Decimal256` type; +- Subtle error in reading `DatetimeV2` type; +- Incorrect result in reading `DateV2` type; +- Error when reading `IPV4/IPV6` type result is NULL; + +2. Fix the problem that Doris Arrow Flight SQL query fails and returns empty result, without returning real error information. +[Fix query result is empty and not return query error message](https://github.com/apache/doris/pull/45023) + +### v2.1.8 + +1. Support BI tools such as DBeaver to connect to Doris using the `arrow-flight-sql` protocol, and support the correct display of metadata trees. +[Support arrow-flight-sql protocol getStreamCatalogs, getStreamSchemas, getStreamTables #46217](https://github.com/apache/doris/pull/46217). + +2. When multiple BEs share the same IP that is accessible to the outside of the cluster, the query results can be correctly forwarded and returned to the ADBC Client. +[Arrow flight server supports data forwarding when BE uses public vip](https://github.com/apache/doris/pull/43281) + +3. Support multiple endpoints to read in parallel. +[Arrow Flight support multiple endpoints](https://github.com/apache/doris/pull/44286) + +4. Fix query error `FE not found arrow flight schema`. +[Fix FE not found arrow flight schema](https://github.com/apache/doris/pull/43960) + +5. Fix error `BooleanBuilder::AppendValues` when reading columns that allow NULL. +[Fix Doris NULL column conversion to arrow batch](https://github.com/apache/doris/pull/43929) + +6. Fix `show processlist` displays duplicate Connection IDs. +[Fix arrow-flight-sql ConnectContext to use a unified ID #46284](https://github.com/apache/doris/pull/46284) + +### v2.1.7 + +1. Fix frequent log printing `Connection wait_timeout`. +[Fix kill timeout FlightSqlConnection and FlightSqlConnectProcessor close](https://github.com/apache/doris/pull/41770) + +2. Fix Arrow Flight Bearer Token expiration from Cache. +[Fix Arrow Flight bearer token cache evict after expired](https://github.com/apache/doris/pull/41754) + +### v2.1.6 -Problem cause: This is because there is a compatibility issue between the old version of the Linux kernel and Arrow. `cpp: arrow::RecordBatch::MakeEmpty()` will get stuck when constructing Arrow Record Batch, causing Doris BE's Arrow Flight Server to fail to respond to Doris FE's Arrow Flight Server RPC request within 5000ms, causing FE to return rpc timeout failed to Client. When Spark and Flink read Doris, they also convert the query results into Arrow Record Batch and return them, so [...] +1. Fix query error `0.0.0.0:xxx, connection refused`. +[Fix return result from FE Arrow Flight server error 0.0.0.0:xxx, connection refused](https://github.com/apache/doris/pull/40002) -The Linux kernel version of kylinv10 SP2 and SP3 is only up to 4.19.90-24.4.v2101.ky10.aarch64. You cannot continue to upgrade the kernel version. You can only recompile Doris BE on kylinv10. If the problem still exists after compiling Doris BE with the new version of ldb_toolchain, you can try to compile it with the lower version of ldb_toolchain v0.17. If your ARM environment cannot connect to the Internet, Huawei Cloud provides ARM + kylinv10, and Alibaba Cloud provides x86 + kylinv10 +2. Fix query error `Reach limit of connections`. +[Fix exceed user property max connection cause Reach limit of connections #39127](https://github.com/apache/doris/pull/39127) -2. Currently, `jdbc:arrow-flight-sql` and Java ADBC/JDBCDriver do not support prepared statement passing parameters. For example, `select * from xxx where id=?` will report an error `parameter ordinal 1 out of range`. This is a bug in Arrow Flight SQL ([GitHub Issue](https://github.com/apache/arrow/issues/40118)) +In previous versions, execute `SET PROPERTY FOR 'root' 'max_user_connections' = '1024';` to modify the current maximum number of connections for the current user to 1024, which can be temporarily circumvented. -3. Modification `jdbc:arrow-flight-sql` The batch size of each read can improve performance in some scenarios. By modifying the `setTargetBatchSize` in the `makeJdbcConfig` method in the `org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader` file, the default is 1024, and then saving the modified file to the local directory with the same name, it will overwrite the original file to take effect. +Because the previous version only limits the number of Arrow Flight connections to less than `qe_max_connection/2`, `qe_max_connection` is the total number of connections for all fe users, the default is 1024, and does not limit the number of Arrow Flight connections for a single user to less than `max_user_connections` in `UserProperty`, the default is 100, so when the number of Arrow Flight connections exceeds the upper limit of the current user's connection number, an error `Reach lim [...] -4. ADBC v0.10, JDBC and Java ADBC/JDBCDriver do not support parallel reading, and the `stmt.executePartitioned()` method is not implemented. You can only use the native FlightClient to implement parallel reading of multiple Endpoints, using the method `sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`. In addition, the default AdbcStatement of ADBC V0.10 is actually JdbcStatement. After executeQue [...] +For details of the problem, see: [Questions](https://ask.selectdb.com/questions/D18b1/2-1-4-ban-ben-python-shi-yong-arrow-flight-sql-lian-jie-bu-hui-duan-kai-lian-jie-shu-zhan-man-da-dao-100/E1ic1?commentId=10070000000005324) -5. As of Arrow v15.0, Arrow JDBC Connector does not support specifying the database name in the URL. For example, `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` specifies that the connection to the `test` database is invalid. You can only execute the SQL `use database` manually. +3. Add Conf `arrow_flight_result_sink_buffer_size_rows` to support modifying the ArrowBatch size of query results returned in a single time, the default is 4096 * 8. +[Add config arrow_flight_result_sink_buffer_size_rows](https://github.com/apache/doris/pull/38221) -6. There is a bug in Doris 2.1.4 version. There is a chance of error when reading large amounts of data. This bug is fixed in [Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) PR. Upgrading Doris 2.1.5 version can solve this problem. For details of the problem, see: [Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) +### v2.1.5 -7. `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant` Ignore this warning when using Python. This is a problem with the Python ADBC Client and will not affect the query. +1. Fix the problem that Arrow Flight SQL query results are empty. +[Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) -8. Python reports an error `grpc: received message larger than max (20748753 vs. 16777216)`. Refer to [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) to add `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value` in Database Option. +Doris v2.1.4 may report an error when reading large amounts of data. For details, see: [Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) -9. Before Doris version 2.1.7, the error `Reach limit of connections` is reported. This is because there is no limit on the number of Arrow Flight connections for a single user, which is less than `max_user_connections` in `UserProperty`, which is 100 by default. You can modify the current maximum number of connections for Billie user to 100 by `SET PROPERTY FOR 'Billie' 'max_user_connections' = '1000';`, or add `arrow_flight_token_cache_size=50` in `fe.conf` to limit the overall number [...] +### Doris Arrow Flight v2.1.4 and earlier versions are not perfect. It is recommended to upgrade before use. diff --git a/versioned_docs/version-3.0/db-connect/arrow-flight-sql-connect.md b/versioned_docs/version-3.0/db-connect/arrow-flight-sql-connect.md index a122ff16b8d..b2a10f63c73 100644 --- a/versioned_docs/version-3.0/db-connect/arrow-flight-sql-connect.md +++ b/versioned_docs/version-3.0/db-connect/arrow-flight-sql-connect.md @@ -451,7 +451,9 @@ try ( ### Choice of Jdbc and Java connection methods -Compared with the traditional `jdbc:mysql` connection method, the performance test of the Arrow Flight SQL connection method of Jdbc and Java can be found at [GitHub Issue](https://github.com/apache/doris/issues/25514). Here are some usage suggestions based on the test conclusions. +[JDBC/Java Arrow Flight SQL Sample](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/README.md) is a JDBC/Java demo using Arrow FLight SQL. You can use it to test various connection methods for sending queries to Arrow Flight Server, helping you understand how to use Arrow FLight SQL and test performance. Implemented in [Add Arrow Flight Sql demo for Java](https://github.com/apache/doris/pull/45306). + +Compared with the traditional `jdbc:mysql` connection method, the performance test of the Arrow Flight SQL connection method of Jdbc and Java can be found in Section 6.2 of [GitHub Issue](https://github.com/apache/doris/issues/25514). Here are some usage suggestions based on the test conclusions. 1. For the above three Java Arrow Flight SQL connection methods, if the subsequent data analysis will be based on the row-based data format, then use jdbc:arrow-flight-sql, which will return data in the JDBC ResultSet format; if the subsequent data analysis can be based on the Arrow format or other column-based data formats, then use Flight AdbcDriver or Flight JdbcDriver to directly return data in the Arrow format, which will avoid row-column conversion and use the characteristics of Ar [...] @@ -471,26 +473,152 @@ Arrow Flight currently has no official plan to support Spark and Flink ([GitHub The community previously referred to the open source [Spark-Flight-Connector](https://github.com/qwshen/spark-flight-connector) and used FlightClient in Spark to connect to Doris for testing. It was found that the data format conversion between Arrow and Doris Block is faster, which is 10 times the conversion speed between CSV format and Doris Block, and it has better support for complex types such as Map and Array. This is because the Arrow data format has a high compression rate and lo [...] +### Support BI tools + +Since Doris v2.1.8, BI tools such as DBeaver are supported to connect to Doris using the `arrow-flight-sql` protocol. For DBeaver's method of connecting to Doris using the `arrow-flight-sql` Driver, refer to: [how-to-use-jdbc-driver-with-dbeaver-client](https://www.dremio.com/blog/jdbc-driver-for-arrow-flight-sql/#h-how-to-use-jdbc-driver-with-dbeaver-client), [client-applications/clients/dbeaver/](https://docs.dremio.com/current/sonar/client-applications/clients/dbeaver/?_gl=1*1epgwh0*_ [...] + +## Extended Application + +### Multiple BEs return results in parallel + +Doris will aggregate the results of a query on all BE nodes to one BE node by default. In Mysql/JDBC queries, FE will request query results from this aggregated data node. In Arrow Flight SQL queries, FE will wrap the IP/Port of this node in the Endpoint and return it to ADBC Client. ADBC Client will request the BE node corresponding to this Endpoint to pull data. + +If the query is just a simple Select to pull data from Doris, without Join, Sort, Window Function and other operators with data Shuffle behavior, the query can be split according to Tablet granularity. Now Doris Spark/Flink Connector uses this method to implement parallel data reading, which is divided into two steps: +1. Execute `explain sql`, and the ScanOperator in the query plan returned by FE contains all Tablet ID Lists of Scan. +2. Split the original SQL into multiple SQLs based on the Tablet ID List above. Each SQL only reads part of the Tablet. The usage is similar to `SELECT * FROM t1 TABLET(10001,10002) limit 1000;`. The multiple SQLs after splitting can be executed in parallel. Refer to [Support select table sample](https://github.com/apache/doris/pull/10170). + +If the outermost layer of the query is aggregation, the SQL is similar to `select k1, sum(k2) from xxx group by k1`. After Doris v3.4, execute `set enable_parallel_result_sink=true;` to allow each BE node of a query to return query results independently. After receiving the Endpoint list returned by FE, ADBC Client pulls results from multiple BE nodes in parallel. However, please note that when the aggregation result is very small, returning multiple BEs will increase the pressure on R [...] + +### Multiple BEs share the same IP accessible from outside the cluster + +If there is a Doris cluster, its FE node can be accessed from outside the cluster, and all its BE nodes can only be accessed from inside the cluster. This is fine when using Mysql Client and JDBC to connect to Doris to execute queries, and the query results will be returned by the Doris FE node. However, using Arrow Flight SQL to connect to Doris cannot execute queries, because ADBC Client needs to connect to the Doris BE node to pull query results, but the Doris BE node is not allowed [...] + +In a production environment, it is often inconvenient to expose the Doris BE node outside the cluster. However, you can add a reverse proxy (such as Nginx) to all Doris BE nodes. When the client outside the cluster connects to Nginx, it will be randomly routed to a Doris BE node. By default, the Arrow Flight SQL query results will be randomly saved on a Doris BE node. If it is different from the Doris BE node randomly routed by Nginx, data forwarding is required within the Doris BE node. + +Starting from Doris v2.1.8, you can configure `public_host` and `arrow_flight_sql_proxy_port` in `be.conf` of all Doris BE nodes to the IP and port shared by multiple Doris BE nodes and accessible outside the cluster. The query results can be correctly forwarded and returned to the ADBC Client. +```conf +public_host={nginx ip} +arrow_flight_sql_proxy_port={nginx port} +``` + ## FAQ -1. ARM environment reports an error `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`. If the Linux kernel version is <= 4.19.90, you need to upgrade to 4.19.279 or above, or recompile Doris BE in the environment of the lower version of the Linux kernel. For specific compilation methods, refer to the document <docs/dev/install/source-install/compilation-arm> +1. Q: ARM environment reports an error `get flight info statement failed, arrow flight schema timeout, TimeoutException: Waited 5000 milliseconds for io.grpc.stub.Client`. + +A: If the Linux kernel version is <= 4.19.90, you need to upgrade to 4.19.279 or above, or recompile Doris BE in the environment of the lower version of the Linux kernel. For specific compilation methods, refer to the document <docs/dev/install/source-install/compilation-arm> + +Cause: This is because there is a compatibility issue between the old version of the Linux kernel and Arrow. `cpp: arrow::RecordBatch::MakeEmpty()` will get stuck when constructing Arrow Record Batch, causing Doris BE's Arrow Flight Server to fail to respond to Doris FE's Arrow Flight Server's RPC request within 5000ms, causing FE to return rpc timeout failed to Client. When Spark and Flink read Doris, they also convert the query results into Arrow Record Batch and return them, so the sa [...] + +The Linux kernel version of kylinv10 SP2 and SP3 is only 4.19.90-24.4.v2101.ky10.aarch64 at most. The kernel version cannot be upgraded further. Doris BE can only be recompiled on kylinv10. If the problem still exists after compiling Doris BE with the new version of ldb_toolchain, you can try to compile it with the lower version of ldb_toolchain v0.17. If your ARM environment cannot connect to the external network, Huawei Cloud provides ARM + kylinv10, and Alibaba Cloud provides x86 + kylinv10 + +2. Q: Prepared statement passes parameters and reports errors. + +A: Currently, `jdbc:arrow-flight-sql` and Java ADBC/JDBCDriver do not support prepared statement parameter passing. For example, `select * from xxx where id=?` will report an error `parameter ordinal 1 out of range`. This is a bug in Arrow Flight SQL ([GitHub Issue](https://github.com/apache/arrow/issues/40118)). + +3. Q: How to modify the batch size read by `jdbc:arrow-flight-sql` each time to improve performance in some scenarios. + +A: By modifying `setTargetBatchSize` in the `makeJdbcConfig` method in the `org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader` file, the default is 1024, and then saving the modified file to the local directory with the same path name, so as to overwrite the original file and take effect. + +4. Q: ADBC v0.10, JDBC and Java ADBC/JDBCDriver do not support parallel reading. + +A: The `stmt.executePartitioned()` method is not implemented. You can only use the native FlightClient to implement parallel reading of multiple endpoints, using the method `sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`. In addition, the default AdbcStatement of ADBC V0.10 is actually JdbcStatement. After executeQuery, the row-format JDBC ResultSet is converted back to the Arrow column format. I [...] + +5. Q: Specify the database name in the URL. + +A: As of Arrow v15.0, Arrow JDBC Connector does not support specifying database name in URL. For example, specifying connection to `test` database in `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` is invalid, and you can only execute SQL `use database` manually. Arrow v18.0 supports specifying database name in URL, but there are still bugs in actual testing. + +6. Q: Python ADBC prints `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant`. + +A: Ignore this Warning when using Python. This is a problem with Python ADBC Client and will not affect queries. + +7. Q: Python reports an error `grpc: received message larger than max (20748753 vs. 16777216)`. + +A: Refer to [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) and add `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value` in Database Option. + +8. Q: Error `invalid bearer token` is reported. + +A: Execute `SET PROPERTY FOR 'root' 'max_user_connections' = '10000';` to change the current maximum number of connections for the current user to 10000; add qe_max_connection=30000 and arrow_flight_token_cache_size=8000 in `fe.conf` and restart FE. + +The connection between the ADBC Client and the Arrow Flight Server is essentially a long link, which requires Auth Token, Connection, and Session to be cached on the Server. After the connection is created, it will not be disconnected immediately at the end of a single query. The Client needs to send a close() request to clean it up, but in fact, the Client often does not send a close request, so the Auth Token, Connection, and Session will be saved on the Arrow Flight Server for a lon [...] + +As of Doris v2.1.8, Arrow Flight connections and Mysql/JDBC connections use the same connection limit, including the total number of connections of all FE users `qe_max_connection` and the number of connections of a single user `max_user_connections` in `UserProperty`. But the default `qe_max_connection` and `max_user_connections` are 1024 and 100 respectively. Arrow Flight SQL is often used to replace JDBC scenarios, but the JDBC connection will be released immediately after the query e [...] + +9. Q: Java Arrow Flight SQL reads the Datatime type and returns a timestamp instead of the formatted time. + +A: Java Arrow Flight SQL needs to convert the timestamp by itself when reading the Datatime type. Refer to [Add java parsing datetime type in arrow flight sql sample #48578](https://github.com/apache/doris/pull/48578). Using Python Arrow Flight SQL to read the Datatime type returns `2025-03-03 17:23:28Z`, while Java Arrow Flight SQL returns `1740993808`. + +10. Q: Java Arrow Flight JDBC Client reports an error `Configuration does not provide a mapping for array column 2` when reading Array nested types. + +A: Refer to [`sample/arrow-flight-sql`](https://github.com/apache/doris/blob/master/samples/arrow-flight-sql/java/src/main/java/doris/arrowflight/demo/FlightAdbcDriver.java) to use JAVA ADBC Client. + +Python ADBC Client, JAVA ADBC Client, and Java JDBC DriverManager all have no problem reading Array nested types. Only Java Arrow Flight JDBC Client has problems. In fact, the compatibility of Arrow Flight JDBC Client is not guaranteed. It is not officially developed by Arrow, but by a third-party database company Dremio. Other compatibility issues have been found before, so it is recommended to use JAVA ADBC Client first. + +## Release Note + +> Arrow Flight SQL protocol is supported since Doris 2.1. As of Doris 2.1.9, the issues fixed are listed based on Doris 2.1 series versions. Doris 3.0 series versions are self-checked. + +### v2.1.9 + +1. Fix the problem of Doris data serialization to Arrow. +[Fix UT DataTypeSerDeArrowTest of Array/Map/Struct/Bitmap/HLL/Decimal256 types](https://github.com/apache/doris/pull/48944) +- Failed to read `Decimal256` type; +- Subtle error in reading `DatetimeV2` type; +- Incorrect result in reading `DateV2` type; +- Error when reading `IPV4/IPV6` type result is NULL; + +2. Fix the problem that Doris Arrow Flight SQL query fails and returns empty result, without returning real error information. +[Fix query result is empty and not return query error message](https://github.com/apache/doris/pull/45023) + +### v2.1.8 + +1. Support BI tools such as DBeaver to connect to Doris using the `arrow-flight-sql` protocol, and support the correct display of metadata trees. +[Support arrow-flight-sql protocol getStreamCatalogs, getStreamSchemas, getStreamTables #46217](https://github.com/apache/doris/pull/46217). + +2. When multiple BEs share the same IP that is accessible to the outside of the cluster, the query results can be correctly forwarded and returned to the ADBC Client. +[Arrow flight server supports data forwarding when BE uses public vip](https://github.com/apache/doris/pull/43281) + +3. Support multiple endpoints to read in parallel. +[Arrow Flight support multiple endpoints](https://github.com/apache/doris/pull/44286) + +4. Fix query error `FE not found arrow flight schema`. +[Fix FE not found arrow flight schema](https://github.com/apache/doris/pull/43960) + +5. Fix error `BooleanBuilder::AppendValues` when reading columns that allow NULL. +[Fix Doris NULL column conversion to arrow batch](https://github.com/apache/doris/pull/43929) + +6. Fix `show processlist` displays duplicate Connection IDs. +[Fix arrow-flight-sql ConnectContext to use a unified ID #46284](https://github.com/apache/doris/pull/46284) + +### v2.1.7 + +1. Fix frequent log printing `Connection wait_timeout`. +[Fix kill timeout FlightSqlConnection and FlightSqlConnectProcessor close](https://github.com/apache/doris/pull/41770) + +2. Fix Arrow Flight Bearer Token expiration from Cache. +[Fix Arrow Flight bearer token cache evict after expired](https://github.com/apache/doris/pull/41754) + +### v2.1.6 -Problem cause: This is because there is a compatibility issue between the old version of the Linux kernel and Arrow. `cpp: arrow::RecordBatch::MakeEmpty()` will get stuck when constructing Arrow Record Batch, causing Doris BE's Arrow Flight Server to fail to respond to Doris FE's Arrow Flight Server RPC request within 5000ms, causing FE to return rpc timeout failed to Client. When Spark and Flink read Doris, they also convert the query results into Arrow Record Batch and return them, so [...] +1. Fix query error `0.0.0.0:xxx, connection refused`. +[Fix return result from FE Arrow Flight server error 0.0.0.0:xxx, connection refused](https://github.com/apache/doris/pull/40002) -The Linux kernel version of kylinv10 SP2 and SP3 is only up to 4.19.90-24.4.v2101.ky10.aarch64. You cannot continue to upgrade the kernel version. You can only recompile Doris BE on kylinv10. If the problem still exists after compiling Doris BE with the new version of ldb_toolchain, you can try to compile it with the lower version of ldb_toolchain v0.17. If your ARM environment cannot connect to the Internet, Huawei Cloud provides ARM + kylinv10, and Alibaba Cloud provides x86 + kylinv10 +2. Fix query error `Reach limit of connections`. +[Fix exceed user property max connection cause Reach limit of connections #39127](https://github.com/apache/doris/pull/39127) -2. Currently, `jdbc:arrow-flight-sql` and Java ADBC/JDBCDriver do not support prepared statement passing parameters. For example, `select * from xxx where id=?` will report an error `parameter ordinal 1 out of range`. This is a bug in Arrow Flight SQL ([GitHub Issue](https://github.com/apache/arrow/issues/40118)) +In previous versions, execute `SET PROPERTY FOR 'root' 'max_user_connections' = '1024';` to modify the current maximum number of connections for the current user to 1024, which can be temporarily circumvented. -3. Modification `jdbc:arrow-flight-sql` The batch size of each read can improve performance in some scenarios. By modifying the `setTargetBatchSize` in the `makeJdbcConfig` method in the `org.apache.arrow.adbc.driver.jdbc.JdbcArrowReader` file, the default is 1024, and then saving the modified file to the local directory with the same name, it will overwrite the original file to take effect. +Because the previous version only limits the number of Arrow Flight connections to less than `qe_max_connection/2`, `qe_max_connection` is the total number of connections for all fe users, the default is 1024, and does not limit the number of Arrow Flight connections for a single user to less than `max_user_connections` in `UserProperty`, the default is 100, so when the number of Arrow Flight connections exceeds the upper limit of the current user's connection number, an error `Reach lim [...] -4. ADBC v0.10, JDBC and Java ADBC/JDBCDriver do not support parallel reading, and the `stmt.executePartitioned()` method is not implemented. You can only use the native FlightClient to implement parallel reading of multiple Endpoints, using the method `sqlClient=new FlightSqlClient, execute=sqlClient.execute(sql), endpoints=execute.getEndpoints(), for(FlightEndpoint endpoint: endpoints)`. In addition, the default AdbcStatement of ADBC V0.10 is actually JdbcStatement. After executeQue [...] +For details of the problem, see: [Questions](https://ask.selectdb.com/questions/D18b1/2-1-4-ban-ben-python-shi-yong-arrow-flight-sql-lian-jie-bu-hui-duan-kai-lian-jie-shu-zhan-man-da-dao-100/E1ic1?commentId=10070000000005324) -5. As of Arrow v15.0, Arrow JDBC Connector does not support specifying the database name in the URL. For example, `jdbc:arrow-flight-sql://{FE_HOST}:{fe.conf:arrow_flight_sql_port}/test?useServerPrepStmts=false` specifies that the connection to the `test` database is invalid. You can only execute the SQL `use database` manually. +3. Add Conf `arrow_flight_result_sink_buffer_size_rows` to support modifying the ArrowBatch size of query results returned in a single time, the default is 4096 * 8. +[Add config arrow_flight_result_sink_buffer_size_rows](https://github.com/apache/doris/pull/38221) -6. There is a bug in Doris 2.1.4 version. There is a chance of error when reading large amounts of data. This bug is fixed in [Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) PR. Upgrading Doris 2.1.5 version can solve this problem. For details of the problem, see: [Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) +### v2.1.5 -7. `Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant` Ignore this warning when using Python. This is a problem with the Python ADBC Client and will not affect the query. +1. Fix the problem that Arrow Flight SQL query results are empty. +[Fix arrow flight result sink #36827](https://github.com/apache/doris/pull/36827) -8. Python reports an error `grpc: received message larger than max (20748753 vs. 16777216)`. Refer to [Python: grpc: received message larger than max (20748753 vs. 16777216) #2078](https://github.com/apache/arrow-adbc/issues/2078) to add `adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value` in Database Option. +Doris v2.1.4 may report an error when reading large amounts of data. For details, see: [Questions](https://ask.selectdb.com/questions/D1Ia1/arrow-flight-sql-shi-yong-python-de-adbc-driver-lian-jie-doris-zhi-xing-cha-xun-sql-du-qu-bu-dao-shu-ju) -9. Before Doris version 2.1.7, the error `Reach limit of connections` is reported. This is because there is no limit on the number of Arrow Flight connections for a single user, which is less than `max_user_connections` in `UserProperty`, which is 100 by default. You can modify the current maximum number of connections for Billie user to 100 by `SET PROPERTY FOR 'Billie' 'max_user_connections' = '1000';`, or add `arrow_flight_token_cache_size=50` in `fe.conf` to limit the overall number [...] +### Doris Arrow Flight v2.1.4 and earlier versions are not perfect. It is recommended to upgrade before use. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org