This is an automated email from the ASF dual-hosted git repository. dataroaring pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 2f452f82384 [doc](load) add bigquery datasource doc (#2318) 2f452f82384 is described below commit 2f452f82384bee1330f8cf611b64662b6c30aea2 Author: wudi <w...@selectdb.com> AuthorDate: Thu Apr 24 19:58:38 2025 +0800 [doc](load) add bigquery datasource doc (#2318) ## Versions - [x] dev - [x] 3.0 - [x] 2.1 - [ ] 2.0 ## Languages - [x] Chinese - [x] English ## Docs Checklist - [ ] Checked by AI - [ ] Test Cases Built --- docs/data-operate/import/data-source/bigquery.md | 220 +++++++++++++++++++++ .../data-operate/import/data-source/bigquery.md | 218 ++++++++++++++++++++ .../data-operate/import/data-source/bigquery.md | 218 ++++++++++++++++++++ .../data-operate/import/data-source/bigquery.md | 218 ++++++++++++++++++++ sidebars.json | 1 + static/images/data-operate/gcs_export.png | Bin 0 -> 159797 bytes .../data-operate/import/data-source/bigquery.md | 220 +++++++++++++++++++++ .../data-operate/import/data-source/bigquery.md | 220 +++++++++++++++++++++ versioned_sidebars/version-2.1-sidebars.json | 1 + versioned_sidebars/version-3.0-sidebars.json | 1 + 10 files changed, 1317 insertions(+) diff --git a/docs/data-operate/import/data-source/bigquery.md b/docs/data-operate/import/data-source/bigquery.md new file mode 100644 index 00000000000..417535e3902 --- /dev/null +++ b/docs/data-operate/import/data-source/bigquery.md @@ -0,0 +1,220 @@ +--- +{ + "title": "BigQuery", + "language": "en" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +During the process of migrating BigQuery, it is usually necessary to use object storage as an intermediate medium. The core process is as follows: First, use BigQuery's [Export](https://cloud.google.com/bigquery/docs/exporting-data) statement to export data to GCS (Google Cloud Storage); then, use Doris's S3 Load function to read data from the object storage and load it into Doris. For details, please refer to [S3 Load](./amazon-s3.md). + + +## Considerations + +1. Before the migration, it is necessary to select Doris' [Data Model](../../../table-design/data-model/overview.md), as well as the strategies for [Partitioning](../../../table-design/data-partitioning/dynamic-partitioning.md) and [Bucketing](../../../table-design/data-partitioning/data-bucketing.md) according to the table structure of BigQuery. For more table creation strategies, please refer to [Load Best Practices](../load-best-practices.md). +2. When BigQuery exports data in JSON type, it does not support exporting in Parquet format. You can export it in JSON format instead. +3. When BigQuery exports data of the Time type, it is necessary to export it after casting it to the String type. + +## Data type mapping + +| BigQuery | Doris | Comment | +| ------------------ | -------------- | -------------------- | +| Array | Array | | +| BOOLEAN | BOOLEAN | | +| DATE | DATE | | +| DATETIME/TIMESTAMP | DATETIME | | +| JSON | JSON | | +| INT64 | BIGINT | | +| NUMERIC | DECIMAL | | +| FLOAT64 | DOUBLE | | +| STRING | VARCHAR/STRING | VARCHAR maximum length is 65535 | +| STRUCT | STRUCT | | +| TIME | STRING | | +| OTHER | UNSUPPORTED | | + +## 1. Create Table + +When migrating a BigQuery table to Doris, it is necessary to create a Doris table first. + +Suppose we already have the following table and data in BigQuery. + +```SQL +CREATE OR REPLACE TABLE test.sales_data ( + order_id INT64, + customer_name STRING, + order_date DATE, + amount NUMERIC(10,2), + country STRING +) +PARTITION BY order_date + + +INSERT INTO test.sales_data (order_id, customer_name, order_date, amount, country) VALUES +(1, 'Alice', '2025-04-08', 99.99, 'USA'), +(2, 'Bob', '2025-04-08', 149.50, 'Canada'), +(3, 'Charlie', '2025-04-09', 75.00, 'UK'), +(4, 'Diana', '2025-04-10', 200.00, 'Australia'); +``` + +According to this table structure, a Doris primary key partitioned table can be created. The partition field should be the same as that in BigQuery, and the table should be partitioned on a daily basis. + +```sql +CREATE TABLE `sales_data` ( + order_id INT, + order_date DATE NOT NULL, + customer_name VARCHAR(128), + amount DECIMAL(10,2), + country VARCHAR(48) +) ENGINE=OLAP +UNIQUE KEY(`order_id`,`order_date`) +PARTITION BY RANGE(`order_date`) () +DISTRIBUTED BY HASH(`order_id`) BUCKETS 16 +PROPERTIES ( + "dynamic_partition.enable" = "true", + "dynamic_partition.time_unit" = "DAY", + "dynamic_partition.start" = "-10", + "dynamic_partition.end" = "10", + "dynamic_partition.prefix" = "p", + "dynamic_partition.buckets" = "16", + "replication_num" = "1" +); +``` + +## 2. Export BigQuery Data + +1. **Export to GCS Parquet format file through Export method** + ```sql + EXPORT DATA + OPTIONS ( + uri = 'gs://mybucket/export/sales_data/*.parquet', + format = 'PARQUET') + AS ( + SELECT * + FROM test.sales_data + ); + ``` +2. **View the exported files on GCS** + The above command will export the data of sales_data to GCS, and each partition will generate one or more files with increasing file names. For details, please refer to [exporting-data](https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files), as follows: +  + + +## 3. Load Data to Doris + +S3 Load is an asynchronous data load method. After execution, Doris actively pulls data from the data source. The data source supports object storage compatible with the S3 protocol, including ([AWS S3](./amazon-s3.md),[GCS](./google-cloud-storage.md),[AZURE](./azure-storage.md),etc) + +This method is suitable for scenarios involving large volumes of data that require asynchronous processing in the background. For data imports that need to be handled synchronously, refer to [TVF Load](./amazon-s3.md#load-with-tvf)。 + +*Note: For **Parquet/ORC format files that contain complex types (Struct/Array/Map)**, TVF Load must be used.* + +1. **Loading data from a single file** + + ```sql + LOAD LABEL sales_data_2025_04_08 + ( + DATA INFILE("s3://mybucket/export/sales_data/000000000000.parquet") + INTO TABLE sales_data + FORMAT AS "parquet" + (order_id, order_date, customer_name, amount, country) + ) + WITH S3 + ( + "provider" = "GCP", + "s3.endpoint" = "storage.asia-southeast1.rep.googleapis.com", + "s3.region" = "asia-southeast1", + "s3.access_key" = "<ak>", + "s3.secret_key" = "<sk>" + ); + ``` + +2. **Check Load Status via SHOW LOAD** + + Since S3 Load import is submitted asynchronously, you can check the status of a specific label using SHOW LOAD: + + ```yaml + mysql> show load where label = "label_sales_data_2025_04_08"\G + *************************** 1. row *************************** + JobId: 17956078 + Label: label_sales_data_2025_04_08 + State: FINISHED + Progress: 100.00% (1/1) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=2 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):3600; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: NULL + CreateTime: 2025-04-10 17:50:53 + EtlStartTime: 2025-04-10 17:50:54 + EtlFinishTime: 2025-04-10 17:50:54 + LoadStartTime: 2025-04-10 17:50:54 + LoadFinishTime: 2025-04-10 17:50:54 + URL: NULL + JobDetails: {"Unfinished backends":{"5eec1be8612d4872-91040ff1e7208a4f":[]},"ScannedRows":2,"TaskNumber":1,"LoadBytes":91,"All backends":{"5eec1be8612d4872-91040ff1e7208a4f":[10022]},"FileNumber":1,"FileSize":1620} + TransactionId: 766228 + ErrorTablets: {} + User: root + Comment: + 1 row in set (0.00 sec) + ``` + +3. **Handle Load Errors** + + When there are multiple load tasks, you can use the following statement to query the dates and reasons for data load failures. + + ```yaml + mysql> show load where state='CANCELLED' and label like "label_test%"\G + *************************** 1. row *************************** + JobId: 18312384 + Label: label_test123 + State: CANCELLED + Progress: 100.00% (3/3) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=4; dpp.norm.ALL=0 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):14400; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel + CreateTime: 2025-04-15 17:32:59 + EtlStartTime: 2025-04-15 17:33:02 + EtlFinishTime: 2025-04-15 17:33:02 + LoadStartTime: 2025-04-15 17:33:02 + LoadFinishTime: 2025-04-15 17:33:02 + URL: http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342 + JobDetails: {"Unfinished backends":{"7602ccd7c3a4854-95307efca7bfe341":[]},"ScannedRows":4,"TaskNumber":1,"LoadBytes":188,"All backends":{"7602ccd7c3a4854-95307efca7bfe341":[10022]},"FileNumber":3,"FileSize":4839} + TransactionId: 769213 + ErrorTablets: {} + User: root + Comment: + ``` + + As shown in the example above, the issue is a **data quality error**(ETL_QUALITY_UNSATISFIED). To view the detailed error, you need to visit the URL provided in the result. For example, the data exceeded the defined length of the country column in the table schema: + + ```python + [root@VM-10-6-centos ~]$ curl "http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342" + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [USA] schema length: 1; actual length: 3; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Canada] schema length: 1; actual length: 6; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [UK] schema length: 1; actual length: 2; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Australia] schema length: 1; actual length: 9; . src line []; + ``` + + For data quality errors, if you want to allow skipping erroneous records, you can set a fault tolerance rate in the Properties section of the S3 Load task. For details, refer to [Load Configuration Parameters](../../import/import-way/broker-load-manual.md#related-configurations)。 + +1. **Loading data from multiple files** + + When migrating a large volume of historical data, it is recommended to use a batch load strategy. Each batch corresponds to one or a few partitions in Doris. It is recommended to keep the data size under 100GB per batch to reduce system load and lower the cost of retries in case of load failures. + + You can refer to the script [s3_load_file_demo.sh](https://github.com/apache/doris/blob/master/samples/load/shell/s3_load_file_demo.sh), which can split the file list under the specified directory on the object storage and submit multiple S3 Load tasks to Doris in batches to achieve the effect of batch load. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/data-source/bigquery.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/data-source/bigquery.md new file mode 100644 index 00000000000..853521633cf --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/data-source/bigquery.md @@ -0,0 +1,218 @@ +--- +{ + "title": "BigQuery", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +在迁移 BigQuery 的过程中,通常需要借助对象存储作为中间媒介。核心流程如下:首先通过 BigQuery 的 [Export](https://cloud.google.com/bigquery/docs/exporting-data) 语句将数据导出到 GCS(Google Cloud Storage);再利用 Doris 的 S3 Load 功能从对象存储中读取数据并导入到 Doris 中,具体可参考 [S3 导入](./amazon-s3.md)。 + +## 注意事项 + +1. 在迁移之前,需要根据 BigQuery 的表结构选择 Doris 的[数据模型](../../../table-design/data-model/overview.md),以及[分区](../../../table-design/data-partitioning/dynamic-partitioning.md)和[分桶](../../../table-design/data-partitioning/data-bucketing.md)的策略,更多创建表策略可参考[导入最佳实践](../load-best-practices.md)。 +2. BigQuery 导出 JSON 类型时,不支持 Parquet 格式导出,可使用 JSON 格式导出。 +3. BigQuery 导出 Time 类型时,需要 Cast String 类型导出。 + +## 数据类型映射 + +| BigQuery | Doris | 备注 | +| ------------------ | -------------- | -------------------- | +| Array | Array | | +| BOOLEAN | BOOLEAN | | +| DATE | DATE | | +| DATETIME/TIMESTAMP | DATETIME | | +| JSON | JSON | | +| INT64 | BIGINT | | +| NUMERIC | DECIMAL | | +| FLOAT64 | DOUBLE | | +| STRING | VARCHAR/STRING | VARCHAR 长度最大 65535 | +| STRUCT | STRUCT | | +| TIME | STRING | | +| OTHER | UNSUPPORTED | | + +## 1. 创建表 + +在迁移 BigQuery 表到 Doris 中的时候,需要先创建 Doris 表。 + +假设我们在 BigQuery 中已存在如下表和数据 + +```SQL +CREATE OR REPLACE TABLE test.sales_data ( + order_id INT64, + customer_name STRING, + order_date DATE, + amount NUMERIC(10,2), + country STRING +) +PARTITION BY order_date + + +INSERT INTO test.sales_data (order_id, customer_name, order_date, amount, country) VALUES +(1, 'Alice', '2025-04-08', 99.99, 'USA'), +(2, 'Bob', '2025-04-08', 149.50, 'Canada'), +(3, 'Charlie', '2025-04-09', 75.00, 'UK'), +(4, 'Diana', '2025-04-10', 200.00, 'Australia'); +``` + +根据这个表结构,可以创建 Doris 主键分区表,分区字段和 Bigquery 的分区字段一致,同时按天分区 + +```sql +CREATE TABLE `sales_data` ( + order_id INT, + order_date DATE NOT NULL, + customer_name VARCHAR(128), + amount DECIMAL(10,2), + country VARCHAR(48) +) ENGINE=OLAP +UNIQUE KEY(`order_id`,`order_date`) +PARTITION BY RANGE(`order_date`) () +DISTRIBUTED BY HASH(`order_id`) BUCKETS 16 +PROPERTIES ( + "dynamic_partition.enable" = "true", + "dynamic_partition.time_unit" = "DAY", + "dynamic_partition.start" = "-10", + "dynamic_partition.end" = "10", + "dynamic_partition.prefix" = "p", + "dynamic_partition.buckets" = "16", + "replication_num" = "1" +); +``` + +## 2. 导出 BigQuery 数据 + +1. **通过 Export 方式导出到 GCS Parquet 格式的文件** + ```sql + EXPORT DATA + OPTIONS ( + uri = 'gs://mybucket/export/sales_data/*.parquet', + format = 'PARQUET') + AS ( + SELECT * + FROM test.sales_data + ); + ``` +2. **查看 GCS 上的导出文件** + 以上命令会将 sales_data 的数据导出到 GCS 上,并且每个分区会产生一个或多个文件,文件名递增,具体可参考[exporting-data](https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files),如下 +  + +## 3. 导入数据到 Doris + +导入使用 S3 Load 进行导入,**S3 Load 是一种异步的数据导入方式,执行后 Doris 会主动从数据源拉取数据**,数据源支持兼容 S3 协议的对象存储,包括 ([AWS S3](./amazon-s3.md),[GCS](./google-cloud-storage.md),[AZURE](./azure-storage.md)等)。 + +该方式适用于数据量大、需要后台异步处理的场景。对于需要同步处理的数据导入,可以参考 [TVF 导入](./amazon-s3.md#load-with-tvf)。 + +*注意:对于含有复杂类型(Struct/Array/Map)的 Parquet/ORC 格式文件导入,目前必须使用 TVF 导入* + +1. **导入单个文件的数据** + + ```sql + LOAD LABEL sales_data_2025_04_08 + ( + DATA INFILE("s3://mybucket/export/sales_data/000000000000.parquet") + INTO TABLE sales_data + FORMAT AS "parquet" + (order_id, order_date, customer_name, amount, country) + ) + WITH S3 + ( + "provider" = "GCP", + "s3.endpoint" = "storage.asia-southeast1.rep.googleapis.com", + "s3.region" = "asia-southeast1", + "s3.access_key" = "<ak>", + "s3.secret_key" = "<sk>" + ); + ``` + +2. **通过 Show Load 查看任务运行情况** + + 由于 S3Load 导入是异步提交的,所以需要通过 show load 可以查看指定 label 的导入情况: + + ```yaml + mysql> show load where label = "label_sales_data_2025_04_08"\G + *************************** 1. row *************************** + JobId: 17956078 + Label: label_sales_data_2025_04_08 + State: FINISHED + Progress: 100.00% (1/1) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=2 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):3600; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: NULL + CreateTime: 2025-04-10 17:50:53 + EtlStartTime: 2025-04-10 17:50:54 + EtlFinishTime: 2025-04-10 17:50:54 + LoadStartTime: 2025-04-10 17:50:54 + LoadFinishTime: 2025-04-10 17:50:54 + URL: NULL + JobDetails: {"Unfinished backends":{"5eec1be8612d4872-91040ff1e7208a4f":[]},"ScannedRows":2,"TaskNumber":1,"LoadBytes":91,"All backends":{"5eec1be8612d4872-91040ff1e7208a4f":[10022]},"FileNumber":1,"FileSize":1620} + TransactionId: 766228 + ErrorTablets: {} + User: root + Comment: + 1 row in set (0.00 sec) + ``` + +3. **处理导入过程中的错误** + + 当有多个导入任务时,可以通过以下语句,查询数据导入失败的日期和原因。 + + ```yaml + mysql> show load where state='CANCELLED' and label like "label_test%"\G + *************************** 1. row *************************** + JobId: 18312384 + Label: label_test123 + State: CANCELLED + Progress: 100.00% (3/3) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=4; dpp.norm.ALL=0 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):14400; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel + CreateTime: 2025-04-15 17:32:59 + EtlStartTime: 2025-04-15 17:33:02 + EtlFinishTime: 2025-04-15 17:33:02 + LoadStartTime: 2025-04-15 17:33:02 + LoadFinishTime: 2025-04-15 17:33:02 + URL: http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342 + JobDetails: {"Unfinished backends":{"7602ccd7c3a4854-95307efca7bfe341":[]},"ScannedRows":4,"TaskNumber":1,"LoadBytes":188,"All backends":{"7602ccd7c3a4854-95307efca7bfe341":[10022]},"FileNumber":3,"FileSize":4839} + TransactionId: 769213 + ErrorTablets: {} + User: root + Comment: + ``` + + 如上面的例子是**数据质量错误**(ETL_QUALITY_UNSATISFIED),具体错误需要通过访问返回的 URL 的链接进行查看,如下是数据超过了表中的 Schema 中 country 列的实际长度: + + ```python + [root@VM-10-6-centos ~]$ curl "http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342" + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [USA] schema length: 1; actual length: 3; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Canada] schema length: 1; actual length: 6; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [UK] schema length: 1; actual length: 2; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Australia] schema length: 1; actual length: 9; . src line []; + ``` + + 同时对于数据质量的错误,如果可以允许错误数据跳过的,可以通过在 S3 Load 任务中 Properties 设置容错率,具体可参考[导入配置参数](../../import/import-way/broker-load-manual.md#related-configurations)。 + +4. **导入多个文件的数据** + + 当需要迁移大数据量的存量数据时,建议使用分批导入的策略。每批数据对应 Doris 的一个分区或少量几个分区,数据量建议不超过 100GB,以减轻系统压力并降低导入失败后的重试成本。 + + 可参考脚本 [s3_load_file_demo.sh](https://github.com/apache/doris/blob/master/samples/load/shell/s3_load_file_demo.sh),该脚本可以对对象存储上指定目录下的文件列表进行拆分,分批提交多个 S3 Load 任务到 Doris 中,实现批量导入的效果。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/data-source/bigquery.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/data-source/bigquery.md new file mode 100644 index 00000000000..853521633cf --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/data-source/bigquery.md @@ -0,0 +1,218 @@ +--- +{ + "title": "BigQuery", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +在迁移 BigQuery 的过程中,通常需要借助对象存储作为中间媒介。核心流程如下:首先通过 BigQuery 的 [Export](https://cloud.google.com/bigquery/docs/exporting-data) 语句将数据导出到 GCS(Google Cloud Storage);再利用 Doris 的 S3 Load 功能从对象存储中读取数据并导入到 Doris 中,具体可参考 [S3 导入](./amazon-s3.md)。 + +## 注意事项 + +1. 在迁移之前,需要根据 BigQuery 的表结构选择 Doris 的[数据模型](../../../table-design/data-model/overview.md),以及[分区](../../../table-design/data-partitioning/dynamic-partitioning.md)和[分桶](../../../table-design/data-partitioning/data-bucketing.md)的策略,更多创建表策略可参考[导入最佳实践](../load-best-practices.md)。 +2. BigQuery 导出 JSON 类型时,不支持 Parquet 格式导出,可使用 JSON 格式导出。 +3. BigQuery 导出 Time 类型时,需要 Cast String 类型导出。 + +## 数据类型映射 + +| BigQuery | Doris | 备注 | +| ------------------ | -------------- | -------------------- | +| Array | Array | | +| BOOLEAN | BOOLEAN | | +| DATE | DATE | | +| DATETIME/TIMESTAMP | DATETIME | | +| JSON | JSON | | +| INT64 | BIGINT | | +| NUMERIC | DECIMAL | | +| FLOAT64 | DOUBLE | | +| STRING | VARCHAR/STRING | VARCHAR 长度最大 65535 | +| STRUCT | STRUCT | | +| TIME | STRING | | +| OTHER | UNSUPPORTED | | + +## 1. 创建表 + +在迁移 BigQuery 表到 Doris 中的时候,需要先创建 Doris 表。 + +假设我们在 BigQuery 中已存在如下表和数据 + +```SQL +CREATE OR REPLACE TABLE test.sales_data ( + order_id INT64, + customer_name STRING, + order_date DATE, + amount NUMERIC(10,2), + country STRING +) +PARTITION BY order_date + + +INSERT INTO test.sales_data (order_id, customer_name, order_date, amount, country) VALUES +(1, 'Alice', '2025-04-08', 99.99, 'USA'), +(2, 'Bob', '2025-04-08', 149.50, 'Canada'), +(3, 'Charlie', '2025-04-09', 75.00, 'UK'), +(4, 'Diana', '2025-04-10', 200.00, 'Australia'); +``` + +根据这个表结构,可以创建 Doris 主键分区表,分区字段和 Bigquery 的分区字段一致,同时按天分区 + +```sql +CREATE TABLE `sales_data` ( + order_id INT, + order_date DATE NOT NULL, + customer_name VARCHAR(128), + amount DECIMAL(10,2), + country VARCHAR(48) +) ENGINE=OLAP +UNIQUE KEY(`order_id`,`order_date`) +PARTITION BY RANGE(`order_date`) () +DISTRIBUTED BY HASH(`order_id`) BUCKETS 16 +PROPERTIES ( + "dynamic_partition.enable" = "true", + "dynamic_partition.time_unit" = "DAY", + "dynamic_partition.start" = "-10", + "dynamic_partition.end" = "10", + "dynamic_partition.prefix" = "p", + "dynamic_partition.buckets" = "16", + "replication_num" = "1" +); +``` + +## 2. 导出 BigQuery 数据 + +1. **通过 Export 方式导出到 GCS Parquet 格式的文件** + ```sql + EXPORT DATA + OPTIONS ( + uri = 'gs://mybucket/export/sales_data/*.parquet', + format = 'PARQUET') + AS ( + SELECT * + FROM test.sales_data + ); + ``` +2. **查看 GCS 上的导出文件** + 以上命令会将 sales_data 的数据导出到 GCS 上,并且每个分区会产生一个或多个文件,文件名递增,具体可参考[exporting-data](https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files),如下 +  + +## 3. 导入数据到 Doris + +导入使用 S3 Load 进行导入,**S3 Load 是一种异步的数据导入方式,执行后 Doris 会主动从数据源拉取数据**,数据源支持兼容 S3 协议的对象存储,包括 ([AWS S3](./amazon-s3.md),[GCS](./google-cloud-storage.md),[AZURE](./azure-storage.md)等)。 + +该方式适用于数据量大、需要后台异步处理的场景。对于需要同步处理的数据导入,可以参考 [TVF 导入](./amazon-s3.md#load-with-tvf)。 + +*注意:对于含有复杂类型(Struct/Array/Map)的 Parquet/ORC 格式文件导入,目前必须使用 TVF 导入* + +1. **导入单个文件的数据** + + ```sql + LOAD LABEL sales_data_2025_04_08 + ( + DATA INFILE("s3://mybucket/export/sales_data/000000000000.parquet") + INTO TABLE sales_data + FORMAT AS "parquet" + (order_id, order_date, customer_name, amount, country) + ) + WITH S3 + ( + "provider" = "GCP", + "s3.endpoint" = "storage.asia-southeast1.rep.googleapis.com", + "s3.region" = "asia-southeast1", + "s3.access_key" = "<ak>", + "s3.secret_key" = "<sk>" + ); + ``` + +2. **通过 Show Load 查看任务运行情况** + + 由于 S3Load 导入是异步提交的,所以需要通过 show load 可以查看指定 label 的导入情况: + + ```yaml + mysql> show load where label = "label_sales_data_2025_04_08"\G + *************************** 1. row *************************** + JobId: 17956078 + Label: label_sales_data_2025_04_08 + State: FINISHED + Progress: 100.00% (1/1) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=2 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):3600; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: NULL + CreateTime: 2025-04-10 17:50:53 + EtlStartTime: 2025-04-10 17:50:54 + EtlFinishTime: 2025-04-10 17:50:54 + LoadStartTime: 2025-04-10 17:50:54 + LoadFinishTime: 2025-04-10 17:50:54 + URL: NULL + JobDetails: {"Unfinished backends":{"5eec1be8612d4872-91040ff1e7208a4f":[]},"ScannedRows":2,"TaskNumber":1,"LoadBytes":91,"All backends":{"5eec1be8612d4872-91040ff1e7208a4f":[10022]},"FileNumber":1,"FileSize":1620} + TransactionId: 766228 + ErrorTablets: {} + User: root + Comment: + 1 row in set (0.00 sec) + ``` + +3. **处理导入过程中的错误** + + 当有多个导入任务时,可以通过以下语句,查询数据导入失败的日期和原因。 + + ```yaml + mysql> show load where state='CANCELLED' and label like "label_test%"\G + *************************** 1. row *************************** + JobId: 18312384 + Label: label_test123 + State: CANCELLED + Progress: 100.00% (3/3) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=4; dpp.norm.ALL=0 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):14400; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel + CreateTime: 2025-04-15 17:32:59 + EtlStartTime: 2025-04-15 17:33:02 + EtlFinishTime: 2025-04-15 17:33:02 + LoadStartTime: 2025-04-15 17:33:02 + LoadFinishTime: 2025-04-15 17:33:02 + URL: http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342 + JobDetails: {"Unfinished backends":{"7602ccd7c3a4854-95307efca7bfe341":[]},"ScannedRows":4,"TaskNumber":1,"LoadBytes":188,"All backends":{"7602ccd7c3a4854-95307efca7bfe341":[10022]},"FileNumber":3,"FileSize":4839} + TransactionId: 769213 + ErrorTablets: {} + User: root + Comment: + ``` + + 如上面的例子是**数据质量错误**(ETL_QUALITY_UNSATISFIED),具体错误需要通过访问返回的 URL 的链接进行查看,如下是数据超过了表中的 Schema 中 country 列的实际长度: + + ```python + [root@VM-10-6-centos ~]$ curl "http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342" + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [USA] schema length: 1; actual length: 3; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Canada] schema length: 1; actual length: 6; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [UK] schema length: 1; actual length: 2; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Australia] schema length: 1; actual length: 9; . src line []; + ``` + + 同时对于数据质量的错误,如果可以允许错误数据跳过的,可以通过在 S3 Load 任务中 Properties 设置容错率,具体可参考[导入配置参数](../../import/import-way/broker-load-manual.md#related-configurations)。 + +4. **导入多个文件的数据** + + 当需要迁移大数据量的存量数据时,建议使用分批导入的策略。每批数据对应 Doris 的一个分区或少量几个分区,数据量建议不超过 100GB,以减轻系统压力并降低导入失败后的重试成本。 + + 可参考脚本 [s3_load_file_demo.sh](https://github.com/apache/doris/blob/master/samples/load/shell/s3_load_file_demo.sh),该脚本可以对对象存储上指定目录下的文件列表进行拆分,分批提交多个 S3 Load 任务到 Doris 中,实现批量导入的效果。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/data-source/bigquery.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/data-source/bigquery.md new file mode 100644 index 00000000000..853521633cf --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/data-source/bigquery.md @@ -0,0 +1,218 @@ +--- +{ + "title": "BigQuery", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +在迁移 BigQuery 的过程中,通常需要借助对象存储作为中间媒介。核心流程如下:首先通过 BigQuery 的 [Export](https://cloud.google.com/bigquery/docs/exporting-data) 语句将数据导出到 GCS(Google Cloud Storage);再利用 Doris 的 S3 Load 功能从对象存储中读取数据并导入到 Doris 中,具体可参考 [S3 导入](./amazon-s3.md)。 + +## 注意事项 + +1. 在迁移之前,需要根据 BigQuery 的表结构选择 Doris 的[数据模型](../../../table-design/data-model/overview.md),以及[分区](../../../table-design/data-partitioning/dynamic-partitioning.md)和[分桶](../../../table-design/data-partitioning/data-bucketing.md)的策略,更多创建表策略可参考[导入最佳实践](../load-best-practices.md)。 +2. BigQuery 导出 JSON 类型时,不支持 Parquet 格式导出,可使用 JSON 格式导出。 +3. BigQuery 导出 Time 类型时,需要 Cast String 类型导出。 + +## 数据类型映射 + +| BigQuery | Doris | 备注 | +| ------------------ | -------------- | -------------------- | +| Array | Array | | +| BOOLEAN | BOOLEAN | | +| DATE | DATE | | +| DATETIME/TIMESTAMP | DATETIME | | +| JSON | JSON | | +| INT64 | BIGINT | | +| NUMERIC | DECIMAL | | +| FLOAT64 | DOUBLE | | +| STRING | VARCHAR/STRING | VARCHAR 长度最大 65535 | +| STRUCT | STRUCT | | +| TIME | STRING | | +| OTHER | UNSUPPORTED | | + +## 1. 创建表 + +在迁移 BigQuery 表到 Doris 中的时候,需要先创建 Doris 表。 + +假设我们在 BigQuery 中已存在如下表和数据 + +```SQL +CREATE OR REPLACE TABLE test.sales_data ( + order_id INT64, + customer_name STRING, + order_date DATE, + amount NUMERIC(10,2), + country STRING +) +PARTITION BY order_date + + +INSERT INTO test.sales_data (order_id, customer_name, order_date, amount, country) VALUES +(1, 'Alice', '2025-04-08', 99.99, 'USA'), +(2, 'Bob', '2025-04-08', 149.50, 'Canada'), +(3, 'Charlie', '2025-04-09', 75.00, 'UK'), +(4, 'Diana', '2025-04-10', 200.00, 'Australia'); +``` + +根据这个表结构,可以创建 Doris 主键分区表,分区字段和 Bigquery 的分区字段一致,同时按天分区 + +```sql +CREATE TABLE `sales_data` ( + order_id INT, + order_date DATE NOT NULL, + customer_name VARCHAR(128), + amount DECIMAL(10,2), + country VARCHAR(48) +) ENGINE=OLAP +UNIQUE KEY(`order_id`,`order_date`) +PARTITION BY RANGE(`order_date`) () +DISTRIBUTED BY HASH(`order_id`) BUCKETS 16 +PROPERTIES ( + "dynamic_partition.enable" = "true", + "dynamic_partition.time_unit" = "DAY", + "dynamic_partition.start" = "-10", + "dynamic_partition.end" = "10", + "dynamic_partition.prefix" = "p", + "dynamic_partition.buckets" = "16", + "replication_num" = "1" +); +``` + +## 2. 导出 BigQuery 数据 + +1. **通过 Export 方式导出到 GCS Parquet 格式的文件** + ```sql + EXPORT DATA + OPTIONS ( + uri = 'gs://mybucket/export/sales_data/*.parquet', + format = 'PARQUET') + AS ( + SELECT * + FROM test.sales_data + ); + ``` +2. **查看 GCS 上的导出文件** + 以上命令会将 sales_data 的数据导出到 GCS 上,并且每个分区会产生一个或多个文件,文件名递增,具体可参考[exporting-data](https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files),如下 +  + +## 3. 导入数据到 Doris + +导入使用 S3 Load 进行导入,**S3 Load 是一种异步的数据导入方式,执行后 Doris 会主动从数据源拉取数据**,数据源支持兼容 S3 协议的对象存储,包括 ([AWS S3](./amazon-s3.md),[GCS](./google-cloud-storage.md),[AZURE](./azure-storage.md)等)。 + +该方式适用于数据量大、需要后台异步处理的场景。对于需要同步处理的数据导入,可以参考 [TVF 导入](./amazon-s3.md#load-with-tvf)。 + +*注意:对于含有复杂类型(Struct/Array/Map)的 Parquet/ORC 格式文件导入,目前必须使用 TVF 导入* + +1. **导入单个文件的数据** + + ```sql + LOAD LABEL sales_data_2025_04_08 + ( + DATA INFILE("s3://mybucket/export/sales_data/000000000000.parquet") + INTO TABLE sales_data + FORMAT AS "parquet" + (order_id, order_date, customer_name, amount, country) + ) + WITH S3 + ( + "provider" = "GCP", + "s3.endpoint" = "storage.asia-southeast1.rep.googleapis.com", + "s3.region" = "asia-southeast1", + "s3.access_key" = "<ak>", + "s3.secret_key" = "<sk>" + ); + ``` + +2. **通过 Show Load 查看任务运行情况** + + 由于 S3Load 导入是异步提交的,所以需要通过 show load 可以查看指定 label 的导入情况: + + ```yaml + mysql> show load where label = "label_sales_data_2025_04_08"\G + *************************** 1. row *************************** + JobId: 17956078 + Label: label_sales_data_2025_04_08 + State: FINISHED + Progress: 100.00% (1/1) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=2 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):3600; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: NULL + CreateTime: 2025-04-10 17:50:53 + EtlStartTime: 2025-04-10 17:50:54 + EtlFinishTime: 2025-04-10 17:50:54 + LoadStartTime: 2025-04-10 17:50:54 + LoadFinishTime: 2025-04-10 17:50:54 + URL: NULL + JobDetails: {"Unfinished backends":{"5eec1be8612d4872-91040ff1e7208a4f":[]},"ScannedRows":2,"TaskNumber":1,"LoadBytes":91,"All backends":{"5eec1be8612d4872-91040ff1e7208a4f":[10022]},"FileNumber":1,"FileSize":1620} + TransactionId: 766228 + ErrorTablets: {} + User: root + Comment: + 1 row in set (0.00 sec) + ``` + +3. **处理导入过程中的错误** + + 当有多个导入任务时,可以通过以下语句,查询数据导入失败的日期和原因。 + + ```yaml + mysql> show load where state='CANCELLED' and label like "label_test%"\G + *************************** 1. row *************************** + JobId: 18312384 + Label: label_test123 + State: CANCELLED + Progress: 100.00% (3/3) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=4; dpp.norm.ALL=0 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):14400; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel + CreateTime: 2025-04-15 17:32:59 + EtlStartTime: 2025-04-15 17:33:02 + EtlFinishTime: 2025-04-15 17:33:02 + LoadStartTime: 2025-04-15 17:33:02 + LoadFinishTime: 2025-04-15 17:33:02 + URL: http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342 + JobDetails: {"Unfinished backends":{"7602ccd7c3a4854-95307efca7bfe341":[]},"ScannedRows":4,"TaskNumber":1,"LoadBytes":188,"All backends":{"7602ccd7c3a4854-95307efca7bfe341":[10022]},"FileNumber":3,"FileSize":4839} + TransactionId: 769213 + ErrorTablets: {} + User: root + Comment: + ``` + + 如上面的例子是**数据质量错误**(ETL_QUALITY_UNSATISFIED),具体错误需要通过访问返回的 URL 的链接进行查看,如下是数据超过了表中的 Schema 中 country 列的实际长度: + + ```python + [root@VM-10-6-centos ~]$ curl "http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342" + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [USA] schema length: 1; actual length: 3; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Canada] schema length: 1; actual length: 6; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [UK] schema length: 1; actual length: 2; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Australia] schema length: 1; actual length: 9; . src line []; + ``` + + 同时对于数据质量的错误,如果可以允许错误数据跳过的,可以通过在 S3 Load 任务中 Properties 设置容错率,具体可参考[导入配置参数](../../import/import-way/broker-load-manual.md#related-configurations)。 + +4. **导入多个文件的数据** + + 当需要迁移大数据量的存量数据时,建议使用分批导入的策略。每批数据对应 Doris 的一个分区或少量几个分区,数据量建议不超过 100GB,以减轻系统压力并降低导入失败后的重试成本。 + + 可参考脚本 [s3_load_file_demo.sh](https://github.com/apache/doris/blob/master/samples/load/shell/s3_load_file_demo.sh),该脚本可以对对象存储上指定目录下的文件列表进行拆分,分批提交多个 S3 Load 任务到 Doris 中,实现批量导入的效果。 \ No newline at end of file diff --git a/sidebars.json b/sidebars.json index 827b1e838e8..97ff0f06744 100644 --- a/sidebars.json +++ b/sidebars.json @@ -158,6 +158,7 @@ "data-operate/import/data-source/minio", "data-operate/import/data-source/s3-compatible", "data-operate/import/data-source/snowflake", + "data-operate/import/data-source/bigquery", "data-operate/import/data-source/migrate-data-from-other-olap", "data-operate/import/data-source/migrate-data-from-other-oltp" ] diff --git a/static/images/data-operate/gcs_export.png b/static/images/data-operate/gcs_export.png new file mode 100644 index 00000000000..56cb84fe06a Binary files /dev/null and b/static/images/data-operate/gcs_export.png differ diff --git a/versioned_docs/version-2.1/data-operate/import/data-source/bigquery.md b/versioned_docs/version-2.1/data-operate/import/data-source/bigquery.md new file mode 100644 index 00000000000..417535e3902 --- /dev/null +++ b/versioned_docs/version-2.1/data-operate/import/data-source/bigquery.md @@ -0,0 +1,220 @@ +--- +{ + "title": "BigQuery", + "language": "en" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +During the process of migrating BigQuery, it is usually necessary to use object storage as an intermediate medium. The core process is as follows: First, use BigQuery's [Export](https://cloud.google.com/bigquery/docs/exporting-data) statement to export data to GCS (Google Cloud Storage); then, use Doris's S3 Load function to read data from the object storage and load it into Doris. For details, please refer to [S3 Load](./amazon-s3.md). + + +## Considerations + +1. Before the migration, it is necessary to select Doris' [Data Model](../../../table-design/data-model/overview.md), as well as the strategies for [Partitioning](../../../table-design/data-partitioning/dynamic-partitioning.md) and [Bucketing](../../../table-design/data-partitioning/data-bucketing.md) according to the table structure of BigQuery. For more table creation strategies, please refer to [Load Best Practices](../load-best-practices.md). +2. When BigQuery exports data in JSON type, it does not support exporting in Parquet format. You can export it in JSON format instead. +3. When BigQuery exports data of the Time type, it is necessary to export it after casting it to the String type. + +## Data type mapping + +| BigQuery | Doris | Comment | +| ------------------ | -------------- | -------------------- | +| Array | Array | | +| BOOLEAN | BOOLEAN | | +| DATE | DATE | | +| DATETIME/TIMESTAMP | DATETIME | | +| JSON | JSON | | +| INT64 | BIGINT | | +| NUMERIC | DECIMAL | | +| FLOAT64 | DOUBLE | | +| STRING | VARCHAR/STRING | VARCHAR maximum length is 65535 | +| STRUCT | STRUCT | | +| TIME | STRING | | +| OTHER | UNSUPPORTED | | + +## 1. Create Table + +When migrating a BigQuery table to Doris, it is necessary to create a Doris table first. + +Suppose we already have the following table and data in BigQuery. + +```SQL +CREATE OR REPLACE TABLE test.sales_data ( + order_id INT64, + customer_name STRING, + order_date DATE, + amount NUMERIC(10,2), + country STRING +) +PARTITION BY order_date + + +INSERT INTO test.sales_data (order_id, customer_name, order_date, amount, country) VALUES +(1, 'Alice', '2025-04-08', 99.99, 'USA'), +(2, 'Bob', '2025-04-08', 149.50, 'Canada'), +(3, 'Charlie', '2025-04-09', 75.00, 'UK'), +(4, 'Diana', '2025-04-10', 200.00, 'Australia'); +``` + +According to this table structure, a Doris primary key partitioned table can be created. The partition field should be the same as that in BigQuery, and the table should be partitioned on a daily basis. + +```sql +CREATE TABLE `sales_data` ( + order_id INT, + order_date DATE NOT NULL, + customer_name VARCHAR(128), + amount DECIMAL(10,2), + country VARCHAR(48) +) ENGINE=OLAP +UNIQUE KEY(`order_id`,`order_date`) +PARTITION BY RANGE(`order_date`) () +DISTRIBUTED BY HASH(`order_id`) BUCKETS 16 +PROPERTIES ( + "dynamic_partition.enable" = "true", + "dynamic_partition.time_unit" = "DAY", + "dynamic_partition.start" = "-10", + "dynamic_partition.end" = "10", + "dynamic_partition.prefix" = "p", + "dynamic_partition.buckets" = "16", + "replication_num" = "1" +); +``` + +## 2. Export BigQuery Data + +1. **Export to GCS Parquet format file through Export method** + ```sql + EXPORT DATA + OPTIONS ( + uri = 'gs://mybucket/export/sales_data/*.parquet', + format = 'PARQUET') + AS ( + SELECT * + FROM test.sales_data + ); + ``` +2. **View the exported files on GCS** + The above command will export the data of sales_data to GCS, and each partition will generate one or more files with increasing file names. For details, please refer to [exporting-data](https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files), as follows: +  + + +## 3. Load Data to Doris + +S3 Load is an asynchronous data load method. After execution, Doris actively pulls data from the data source. The data source supports object storage compatible with the S3 protocol, including ([AWS S3](./amazon-s3.md),[GCS](./google-cloud-storage.md),[AZURE](./azure-storage.md),etc) + +This method is suitable for scenarios involving large volumes of data that require asynchronous processing in the background. For data imports that need to be handled synchronously, refer to [TVF Load](./amazon-s3.md#load-with-tvf)。 + +*Note: For **Parquet/ORC format files that contain complex types (Struct/Array/Map)**, TVF Load must be used.* + +1. **Loading data from a single file** + + ```sql + LOAD LABEL sales_data_2025_04_08 + ( + DATA INFILE("s3://mybucket/export/sales_data/000000000000.parquet") + INTO TABLE sales_data + FORMAT AS "parquet" + (order_id, order_date, customer_name, amount, country) + ) + WITH S3 + ( + "provider" = "GCP", + "s3.endpoint" = "storage.asia-southeast1.rep.googleapis.com", + "s3.region" = "asia-southeast1", + "s3.access_key" = "<ak>", + "s3.secret_key" = "<sk>" + ); + ``` + +2. **Check Load Status via SHOW LOAD** + + Since S3 Load import is submitted asynchronously, you can check the status of a specific label using SHOW LOAD: + + ```yaml + mysql> show load where label = "label_sales_data_2025_04_08"\G + *************************** 1. row *************************** + JobId: 17956078 + Label: label_sales_data_2025_04_08 + State: FINISHED + Progress: 100.00% (1/1) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=2 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):3600; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: NULL + CreateTime: 2025-04-10 17:50:53 + EtlStartTime: 2025-04-10 17:50:54 + EtlFinishTime: 2025-04-10 17:50:54 + LoadStartTime: 2025-04-10 17:50:54 + LoadFinishTime: 2025-04-10 17:50:54 + URL: NULL + JobDetails: {"Unfinished backends":{"5eec1be8612d4872-91040ff1e7208a4f":[]},"ScannedRows":2,"TaskNumber":1,"LoadBytes":91,"All backends":{"5eec1be8612d4872-91040ff1e7208a4f":[10022]},"FileNumber":1,"FileSize":1620} + TransactionId: 766228 + ErrorTablets: {} + User: root + Comment: + 1 row in set (0.00 sec) + ``` + +3. **Handle Load Errors** + + When there are multiple load tasks, you can use the following statement to query the dates and reasons for data load failures. + + ```yaml + mysql> show load where state='CANCELLED' and label like "label_test%"\G + *************************** 1. row *************************** + JobId: 18312384 + Label: label_test123 + State: CANCELLED + Progress: 100.00% (3/3) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=4; dpp.norm.ALL=0 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):14400; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel + CreateTime: 2025-04-15 17:32:59 + EtlStartTime: 2025-04-15 17:33:02 + EtlFinishTime: 2025-04-15 17:33:02 + LoadStartTime: 2025-04-15 17:33:02 + LoadFinishTime: 2025-04-15 17:33:02 + URL: http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342 + JobDetails: {"Unfinished backends":{"7602ccd7c3a4854-95307efca7bfe341":[]},"ScannedRows":4,"TaskNumber":1,"LoadBytes":188,"All backends":{"7602ccd7c3a4854-95307efca7bfe341":[10022]},"FileNumber":3,"FileSize":4839} + TransactionId: 769213 + ErrorTablets: {} + User: root + Comment: + ``` + + As shown in the example above, the issue is a **data quality error**(ETL_QUALITY_UNSATISFIED). To view the detailed error, you need to visit the URL provided in the result. For example, the data exceeded the defined length of the country column in the table schema: + + ```python + [root@VM-10-6-centos ~]$ curl "http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342" + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [USA] schema length: 1; actual length: 3; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Canada] schema length: 1; actual length: 6; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [UK] schema length: 1; actual length: 2; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Australia] schema length: 1; actual length: 9; . src line []; + ``` + + For data quality errors, if you want to allow skipping erroneous records, you can set a fault tolerance rate in the Properties section of the S3 Load task. For details, refer to [Load Configuration Parameters](../../import/import-way/broker-load-manual.md#related-configurations)。 + +1. **Loading data from multiple files** + + When migrating a large volume of historical data, it is recommended to use a batch load strategy. Each batch corresponds to one or a few partitions in Doris. It is recommended to keep the data size under 100GB per batch to reduce system load and lower the cost of retries in case of load failures. + + You can refer to the script [s3_load_file_demo.sh](https://github.com/apache/doris/blob/master/samples/load/shell/s3_load_file_demo.sh), which can split the file list under the specified directory on the object storage and submit multiple S3 Load tasks to Doris in batches to achieve the effect of batch load. diff --git a/versioned_docs/version-3.0/data-operate/import/data-source/bigquery.md b/versioned_docs/version-3.0/data-operate/import/data-source/bigquery.md new file mode 100644 index 00000000000..417535e3902 --- /dev/null +++ b/versioned_docs/version-3.0/data-operate/import/data-source/bigquery.md @@ -0,0 +1,220 @@ +--- +{ + "title": "BigQuery", + "language": "en" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +During the process of migrating BigQuery, it is usually necessary to use object storage as an intermediate medium. The core process is as follows: First, use BigQuery's [Export](https://cloud.google.com/bigquery/docs/exporting-data) statement to export data to GCS (Google Cloud Storage); then, use Doris's S3 Load function to read data from the object storage and load it into Doris. For details, please refer to [S3 Load](./amazon-s3.md). + + +## Considerations + +1. Before the migration, it is necessary to select Doris' [Data Model](../../../table-design/data-model/overview.md), as well as the strategies for [Partitioning](../../../table-design/data-partitioning/dynamic-partitioning.md) and [Bucketing](../../../table-design/data-partitioning/data-bucketing.md) according to the table structure of BigQuery. For more table creation strategies, please refer to [Load Best Practices](../load-best-practices.md). +2. When BigQuery exports data in JSON type, it does not support exporting in Parquet format. You can export it in JSON format instead. +3. When BigQuery exports data of the Time type, it is necessary to export it after casting it to the String type. + +## Data type mapping + +| BigQuery | Doris | Comment | +| ------------------ | -------------- | -------------------- | +| Array | Array | | +| BOOLEAN | BOOLEAN | | +| DATE | DATE | | +| DATETIME/TIMESTAMP | DATETIME | | +| JSON | JSON | | +| INT64 | BIGINT | | +| NUMERIC | DECIMAL | | +| FLOAT64 | DOUBLE | | +| STRING | VARCHAR/STRING | VARCHAR maximum length is 65535 | +| STRUCT | STRUCT | | +| TIME | STRING | | +| OTHER | UNSUPPORTED | | + +## 1. Create Table + +When migrating a BigQuery table to Doris, it is necessary to create a Doris table first. + +Suppose we already have the following table and data in BigQuery. + +```SQL +CREATE OR REPLACE TABLE test.sales_data ( + order_id INT64, + customer_name STRING, + order_date DATE, + amount NUMERIC(10,2), + country STRING +) +PARTITION BY order_date + + +INSERT INTO test.sales_data (order_id, customer_name, order_date, amount, country) VALUES +(1, 'Alice', '2025-04-08', 99.99, 'USA'), +(2, 'Bob', '2025-04-08', 149.50, 'Canada'), +(3, 'Charlie', '2025-04-09', 75.00, 'UK'), +(4, 'Diana', '2025-04-10', 200.00, 'Australia'); +``` + +According to this table structure, a Doris primary key partitioned table can be created. The partition field should be the same as that in BigQuery, and the table should be partitioned on a daily basis. + +```sql +CREATE TABLE `sales_data` ( + order_id INT, + order_date DATE NOT NULL, + customer_name VARCHAR(128), + amount DECIMAL(10,2), + country VARCHAR(48) +) ENGINE=OLAP +UNIQUE KEY(`order_id`,`order_date`) +PARTITION BY RANGE(`order_date`) () +DISTRIBUTED BY HASH(`order_id`) BUCKETS 16 +PROPERTIES ( + "dynamic_partition.enable" = "true", + "dynamic_partition.time_unit" = "DAY", + "dynamic_partition.start" = "-10", + "dynamic_partition.end" = "10", + "dynamic_partition.prefix" = "p", + "dynamic_partition.buckets" = "16", + "replication_num" = "1" +); +``` + +## 2. Export BigQuery Data + +1. **Export to GCS Parquet format file through Export method** + ```sql + EXPORT DATA + OPTIONS ( + uri = 'gs://mybucket/export/sales_data/*.parquet', + format = 'PARQUET') + AS ( + SELECT * + FROM test.sales_data + ); + ``` +2. **View the exported files on GCS** + The above command will export the data of sales_data to GCS, and each partition will generate one or more files with increasing file names. For details, please refer to [exporting-data](https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files), as follows: +  + + +## 3. Load Data to Doris + +S3 Load is an asynchronous data load method. After execution, Doris actively pulls data from the data source. The data source supports object storage compatible with the S3 protocol, including ([AWS S3](./amazon-s3.md),[GCS](./google-cloud-storage.md),[AZURE](./azure-storage.md),etc) + +This method is suitable for scenarios involving large volumes of data that require asynchronous processing in the background. For data imports that need to be handled synchronously, refer to [TVF Load](./amazon-s3.md#load-with-tvf)。 + +*Note: For **Parquet/ORC format files that contain complex types (Struct/Array/Map)**, TVF Load must be used.* + +1. **Loading data from a single file** + + ```sql + LOAD LABEL sales_data_2025_04_08 + ( + DATA INFILE("s3://mybucket/export/sales_data/000000000000.parquet") + INTO TABLE sales_data + FORMAT AS "parquet" + (order_id, order_date, customer_name, amount, country) + ) + WITH S3 + ( + "provider" = "GCP", + "s3.endpoint" = "storage.asia-southeast1.rep.googleapis.com", + "s3.region" = "asia-southeast1", + "s3.access_key" = "<ak>", + "s3.secret_key" = "<sk>" + ); + ``` + +2. **Check Load Status via SHOW LOAD** + + Since S3 Load import is submitted asynchronously, you can check the status of a specific label using SHOW LOAD: + + ```yaml + mysql> show load where label = "label_sales_data_2025_04_08"\G + *************************** 1. row *************************** + JobId: 17956078 + Label: label_sales_data_2025_04_08 + State: FINISHED + Progress: 100.00% (1/1) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=2 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):3600; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: NULL + CreateTime: 2025-04-10 17:50:53 + EtlStartTime: 2025-04-10 17:50:54 + EtlFinishTime: 2025-04-10 17:50:54 + LoadStartTime: 2025-04-10 17:50:54 + LoadFinishTime: 2025-04-10 17:50:54 + URL: NULL + JobDetails: {"Unfinished backends":{"5eec1be8612d4872-91040ff1e7208a4f":[]},"ScannedRows":2,"TaskNumber":1,"LoadBytes":91,"All backends":{"5eec1be8612d4872-91040ff1e7208a4f":[10022]},"FileNumber":1,"FileSize":1620} + TransactionId: 766228 + ErrorTablets: {} + User: root + Comment: + 1 row in set (0.00 sec) + ``` + +3. **Handle Load Errors** + + When there are multiple load tasks, you can use the following statement to query the dates and reasons for data load failures. + + ```yaml + mysql> show load where state='CANCELLED' and label like "label_test%"\G + *************************** 1. row *************************** + JobId: 18312384 + Label: label_test123 + State: CANCELLED + Progress: 100.00% (3/3) + Type: BROKER + EtlInfo: unselected.rows=0; dpp.abnorm.ALL=4; dpp.norm.ALL=0 + TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):14400; max_filter_ratio:0.0; priority:NORMAL + ErrorMsg: type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel + CreateTime: 2025-04-15 17:32:59 + EtlStartTime: 2025-04-15 17:33:02 + EtlFinishTime: 2025-04-15 17:33:02 + LoadStartTime: 2025-04-15 17:33:02 + LoadFinishTime: 2025-04-15 17:33:02 + URL: http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342 + JobDetails: {"Unfinished backends":{"7602ccd7c3a4854-95307efca7bfe341":[]},"ScannedRows":4,"TaskNumber":1,"LoadBytes":188,"All backends":{"7602ccd7c3a4854-95307efca7bfe341":[10022]},"FileNumber":3,"FileSize":4839} + TransactionId: 769213 + ErrorTablets: {} + User: root + Comment: + ``` + + As shown in the example above, the issue is a **data quality error**(ETL_QUALITY_UNSATISFIED). To view the detailed error, you need to visit the URL provided in the result. For example, the data exceeded the defined length of the country column in the table schema: + + ```python + [root@VM-10-6-centos ~]$ curl "http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342" + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [USA] schema length: 1; actual length: 3; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Canada] schema length: 1; actual length: 6; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [UK] schema length: 1; actual length: 2; . src line []; + Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Australia] schema length: 1; actual length: 9; . src line []; + ``` + + For data quality errors, if you want to allow skipping erroneous records, you can set a fault tolerance rate in the Properties section of the S3 Load task. For details, refer to [Load Configuration Parameters](../../import/import-way/broker-load-manual.md#related-configurations)。 + +1. **Loading data from multiple files** + + When migrating a large volume of historical data, it is recommended to use a batch load strategy. Each batch corresponds to one or a few partitions in Doris. It is recommended to keep the data size under 100GB per batch to reduce system load and lower the cost of retries in case of load failures. + + You can refer to the script [s3_load_file_demo.sh](https://github.com/apache/doris/blob/master/samples/load/shell/s3_load_file_demo.sh), which can split the file list under the specified directory on the object storage and submit multiple S3 Load tasks to Doris in batches to achieve the effect of batch load. diff --git a/versioned_sidebars/version-2.1-sidebars.json b/versioned_sidebars/version-2.1-sidebars.json index 6069ea8f7f1..c8453959e32 100644 --- a/versioned_sidebars/version-2.1-sidebars.json +++ b/versioned_sidebars/version-2.1-sidebars.json @@ -134,6 +134,7 @@ "data-operate/import/data-source/minio", "data-operate/import/data-source/s3-compatible", "data-operate/import/data-source/snowflake", + "data-operate/import/data-source/bigquery", "data-operate/import/data-source/migrate-data-from-other-olap", "data-operate/import/data-source/migrate-data-from-other-oltp" ] diff --git a/versioned_sidebars/version-3.0-sidebars.json b/versioned_sidebars/version-3.0-sidebars.json index a17d533c196..2fd6f21b618 100644 --- a/versioned_sidebars/version-3.0-sidebars.json +++ b/versioned_sidebars/version-3.0-sidebars.json @@ -158,6 +158,7 @@ "data-operate/import/data-source/minio", "data-operate/import/data-source/s3-compatible", "data-operate/import/data-source/snowflake", + "data-operate/import/data-source/bigquery", "data-operate/import/data-source/migrate-data-from-other-olap", "data-operate/import/data-source/migrate-data-from-other-oltp" ] --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org