This is an automated email from the ASF dual-hosted git repository. morningman pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 4ae5a7f921c [iceberg] add pyiceberg doc for other branch (#1585) 4ae5a7f921c is described below commit 4ae5a7f921c356ee2a1549735c001d6705732839 Author: Mingyu Chen (Rayner) <morning...@163.com> AuthorDate: Tue Dec 24 21:04:28 2024 +0800 [iceberg] add pyiceberg doc for other branch (#1585) ## Versions - [x] dev - [x] 3.0 - [x] 2.1 - [x] 2.0 ## Languages - [x] Chinese - [x] English ## Docs Checklist - [ ] Checked by AI - [ ] Test Cases Built --- .../tutorials/building-lakehouse/doris-iceberg.md | 2 + .../tutorials/building-lakehouse/doris-iceberg.md | 167 +++++++++++++++++++++ .../tutorials/building-lakehouse/doris-iceberg.md | 167 +++++++++++++++++++++ .../tutorials/building-lakehouse/doris-iceberg.md | 167 +++++++++++++++++++++ .../tutorials/building-lakehouse/doris-iceberg.md | 167 +++++++++++++++++++++ .../tutorials/building-lakehouse/doris-iceberg.md | 166 ++++++++++++++++++++ .../tutorials/building-lakehouse/doris-iceberg.md | 166 ++++++++++++++++++++ .../tutorials/building-lakehouse/doris-iceberg.md | 166 ++++++++++++++++++++ 8 files changed, 1168 insertions(+) diff --git a/docs/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md b/docs/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md index af851851f8b..c4f3d3438fd 100644 --- a/docs/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md +++ b/docs/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md @@ -305,6 +305,8 @@ mysql> SELECT * FROM iceberg.nyc.taxis FOR TIME AS OF "2024-07-29 03:40:22"; ### 07 Interacting with PyIceberg +> Please use Doris 2.1.8/3.0.4 or above. + Load an iceberg table: ```python diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md index 3cc43ab17e4..16fc1aa20ec 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md @@ -304,3 +304,170 @@ mysql> SELECT * FROM iceberg.nyc.taxis FOR TIME AS OF "2024-07-29 03:40:22"; +-----------+---------+---------------+-------------+--------------------+----------------------------+ 4 rows in set (0.05 sec) ``` + +### 07 与 PyIceberg 交互 + +> 请使用 Doris 2.1.8/3.0.4 以上版本。 + +加载 Iceberg 表: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +读取为 Arrow Table: + +```python +print(table.scan().to_arrow()) + +pyarrow.Table +vendor_id: int64 +trip_id: int64 +trip_distance: float +fare_amount: double +store_and_fwd_flag: large_string +ts: timestamp[us] +---- +vendor_id: [[1],[1],[2],[2]] +trip_id: [[1000371],[1000374],[1000373],[1000372]] +trip_distance: [[1.8],[8.4],[0.9],[2.5]] +fare_amount: [[15.32],[42.13],[9.01],[22.15]] +store_and_fwd_flag: [["N"],["Y"],["N"],["N"]] +ts: [[2024-01-01 09:15:23.000000],[2024-01-03 07:12:33.000000],[2024-01-01 03:25:15.000000],[2024-01-02 12:10:11.000000]] +``` + +读取为 Pandas DataFrame: + +```python +print(table.scan().to_pandas()) + +vendor_id trip_id trip_distance fare_amount store_and_fwd_flag ts +0 1 1000371 1.8 15.32 N 2024-01-01 09:15:23 +1 1 1000374 8.4 42.13 Y 2024-01-03 07:12:33 +2 2 1000373 0.9 9.01 N 2024-01-01 03:25:15 +3 2 1000372 2.5 22.15 N 2024-01-02 12:10:11 +``` + +读取为 Polars DataFrame: + +```python +import polars as pl + +print(pl.scan_iceberg(table).collect()) + +shape: (4, 6) +┌───────────┬─────────┬───────────────┬─────────────┬────────────────────┬─────────────────────┐ +│ vendor_id ┆ trip_id ┆ trip_distance ┆ fare_amount ┆ store_and_fwd_flag ┆ ts │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ i64 ┆ i64 ┆ f32 ┆ f64 ┆ str ┆ datetime[μs] │ +╞═══════════╪═════════╪═══════════════╪═════════════╪════════════════════╪═════════════════════╡ +│ 1 ┆ 1000371 ┆ 1.8 ┆ 15.32 ┆ N ┆ 2024-01-01 09:15:23 │ +│ 1 ┆ 1000374 ┆ 8.4 ┆ 42.13 ┆ Y ┆ 2024-01-03 07:12:33 │ +│ 2 ┆ 1000373 ┆ 0.9 ┆ 9.01 ┆ N ┆ 2024-01-01 03:25:15 │ +│ 2 ┆ 1000372 ┆ 2.5 ┆ 22.15 ┆ N ┆ 2024-01-02 12:10:11 │ +└───────────┴─────────┴───────────────┴─────────────┴────────────────────┴─────────────────────┘ +``` + +> 通过 pyiceberg 写入 iceberg 数据,请参阅[步骤](#通过-pyiceberg-写入数据) + +### 08 附录 + +#### 通过 PyIceberg 写入数据 + +加载 Iceberg 表: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +Arrow Table 写入 Iceberg: + +```python +import pyarrow as pa + +df = pa.Table.from_pydict( + { + "vendor_id": pa.array([1, 2, 2, 1], pa.int64()), + "trip_id": pa.array([1000371, 1000372, 1000373, 1000374], pa.int64()), + "trip_distance": pa.array([1.8, 2.5, 0.9, 8.4], pa.float32()), + "fare_amount": pa.array([15.32, 22.15, 9.01, 42.13], pa.float64()), + "store_and_fwd_flag": pa.array(["N", "N", "N", "Y"], pa.string()), + "ts": pa.compute.strptime( + ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + "%Y-%m-%d %H:%M:%S", + "us", + ), + } +) +table.append(df) +``` + +Pandas DataFrame 写入 Iceberg: + +```python +import pyarrow as pa +import pandas as pd + +df = pd.DataFrame( + { + "vendor_id": pd.Series([1, 2, 2, 1]).astype("int64[pyarrow]"), + "trip_id": pd.Series([1000371, 1000372, 1000373, 1000374]).astype("int64[pyarrow]"), + "trip_distance": pd.Series([1.8, 2.5, 0.9, 8.4]).astype("float32[pyarrow]"), + "fare_amount": pd.Series([15.32, 22.15, 9.01, 42.13]).astype("float64[pyarrow]"), + "store_and_fwd_flag": pd.Series(["N", "N", "N", "Y"]).astype("string[pyarrow]"), + "ts": pd.Series(["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"]).astype("timestamp[us][pyarrow]"), + } +) +table.append(pa.Table.from_pandas(df)) +``` + +Polars DataFrame 写入 Iceberg: + +```python +import polars as pl + +df = pl.DataFrame( + { + "vendor_id": [1, 2, 2, 1], + "trip_id": [1000371, 1000372, 1000373, 1000374], + "trip_distance": [1.8, 2.5, 0.9, 8.4], + "fare_amount": [15.32, 22.15, 9.01, 42.13], + "store_and_fwd_flag": ["N", "N", "N", "Y"], + "ts": ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + }, + { + "vendor_id": pl.Int64, + "trip_id": pl.Int64, + "trip_distance": pl.Float32, + "fare_amount": pl.Float64, + "store_and_fwd_flag": pl.String, + "ts": pl.String, + }, +).with_columns(pl.col("ts").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")) +table.append(df.to_arrow()) +``` + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md index 3cc43ab17e4..16fc1aa20ec 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md @@ -304,3 +304,170 @@ mysql> SELECT * FROM iceberg.nyc.taxis FOR TIME AS OF "2024-07-29 03:40:22"; +-----------+---------+---------------+-------------+--------------------+----------------------------+ 4 rows in set (0.05 sec) ``` + +### 07 与 PyIceberg 交互 + +> 请使用 Doris 2.1.8/3.0.4 以上版本。 + +加载 Iceberg 表: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +读取为 Arrow Table: + +```python +print(table.scan().to_arrow()) + +pyarrow.Table +vendor_id: int64 +trip_id: int64 +trip_distance: float +fare_amount: double +store_and_fwd_flag: large_string +ts: timestamp[us] +---- +vendor_id: [[1],[1],[2],[2]] +trip_id: [[1000371],[1000374],[1000373],[1000372]] +trip_distance: [[1.8],[8.4],[0.9],[2.5]] +fare_amount: [[15.32],[42.13],[9.01],[22.15]] +store_and_fwd_flag: [["N"],["Y"],["N"],["N"]] +ts: [[2024-01-01 09:15:23.000000],[2024-01-03 07:12:33.000000],[2024-01-01 03:25:15.000000],[2024-01-02 12:10:11.000000]] +``` + +读取为 Pandas DataFrame: + +```python +print(table.scan().to_pandas()) + +vendor_id trip_id trip_distance fare_amount store_and_fwd_flag ts +0 1 1000371 1.8 15.32 N 2024-01-01 09:15:23 +1 1 1000374 8.4 42.13 Y 2024-01-03 07:12:33 +2 2 1000373 0.9 9.01 N 2024-01-01 03:25:15 +3 2 1000372 2.5 22.15 N 2024-01-02 12:10:11 +``` + +读取为 Polars DataFrame: + +```python +import polars as pl + +print(pl.scan_iceberg(table).collect()) + +shape: (4, 6) +┌───────────┬─────────┬───────────────┬─────────────┬────────────────────┬─────────────────────┐ +│ vendor_id ┆ trip_id ┆ trip_distance ┆ fare_amount ┆ store_and_fwd_flag ┆ ts │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ i64 ┆ i64 ┆ f32 ┆ f64 ┆ str ┆ datetime[μs] │ +╞═══════════╪═════════╪═══════════════╪═════════════╪════════════════════╪═════════════════════╡ +│ 1 ┆ 1000371 ┆ 1.8 ┆ 15.32 ┆ N ┆ 2024-01-01 09:15:23 │ +│ 1 ┆ 1000374 ┆ 8.4 ┆ 42.13 ┆ Y ┆ 2024-01-03 07:12:33 │ +│ 2 ┆ 1000373 ┆ 0.9 ┆ 9.01 ┆ N ┆ 2024-01-01 03:25:15 │ +│ 2 ┆ 1000372 ┆ 2.5 ┆ 22.15 ┆ N ┆ 2024-01-02 12:10:11 │ +└───────────┴─────────┴───────────────┴─────────────┴────────────────────┴─────────────────────┘ +``` + +> 通过 pyiceberg 写入 iceberg 数据,请参阅[步骤](#通过-pyiceberg-写入数据) + +### 08 附录 + +#### 通过 PyIceberg 写入数据 + +加载 Iceberg 表: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +Arrow Table 写入 Iceberg: + +```python +import pyarrow as pa + +df = pa.Table.from_pydict( + { + "vendor_id": pa.array([1, 2, 2, 1], pa.int64()), + "trip_id": pa.array([1000371, 1000372, 1000373, 1000374], pa.int64()), + "trip_distance": pa.array([1.8, 2.5, 0.9, 8.4], pa.float32()), + "fare_amount": pa.array([15.32, 22.15, 9.01, 42.13], pa.float64()), + "store_and_fwd_flag": pa.array(["N", "N", "N", "Y"], pa.string()), + "ts": pa.compute.strptime( + ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + "%Y-%m-%d %H:%M:%S", + "us", + ), + } +) +table.append(df) +``` + +Pandas DataFrame 写入 Iceberg: + +```python +import pyarrow as pa +import pandas as pd + +df = pd.DataFrame( + { + "vendor_id": pd.Series([1, 2, 2, 1]).astype("int64[pyarrow]"), + "trip_id": pd.Series([1000371, 1000372, 1000373, 1000374]).astype("int64[pyarrow]"), + "trip_distance": pd.Series([1.8, 2.5, 0.9, 8.4]).astype("float32[pyarrow]"), + "fare_amount": pd.Series([15.32, 22.15, 9.01, 42.13]).astype("float64[pyarrow]"), + "store_and_fwd_flag": pd.Series(["N", "N", "N", "Y"]).astype("string[pyarrow]"), + "ts": pd.Series(["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"]).astype("timestamp[us][pyarrow]"), + } +) +table.append(pa.Table.from_pandas(df)) +``` + +Polars DataFrame 写入 Iceberg: + +```python +import polars as pl + +df = pl.DataFrame( + { + "vendor_id": [1, 2, 2, 1], + "trip_id": [1000371, 1000372, 1000373, 1000374], + "trip_distance": [1.8, 2.5, 0.9, 8.4], + "fare_amount": [15.32, 22.15, 9.01, 42.13], + "store_and_fwd_flag": ["N", "N", "N", "Y"], + "ts": ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + }, + { + "vendor_id": pl.Int64, + "trip_id": pl.Int64, + "trip_distance": pl.Float32, + "fare_amount": pl.Float64, + "store_and_fwd_flag": pl.String, + "ts": pl.String, + }, +).with_columns(pl.col("ts").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")) +table.append(df.to_arrow()) +``` + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md index 3cc43ab17e4..16fc1aa20ec 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md @@ -304,3 +304,170 @@ mysql> SELECT * FROM iceberg.nyc.taxis FOR TIME AS OF "2024-07-29 03:40:22"; +-----------+---------+---------------+-------------+--------------------+----------------------------+ 4 rows in set (0.05 sec) ``` + +### 07 与 PyIceberg 交互 + +> 请使用 Doris 2.1.8/3.0.4 以上版本。 + +加载 Iceberg 表: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +读取为 Arrow Table: + +```python +print(table.scan().to_arrow()) + +pyarrow.Table +vendor_id: int64 +trip_id: int64 +trip_distance: float +fare_amount: double +store_and_fwd_flag: large_string +ts: timestamp[us] +---- +vendor_id: [[1],[1],[2],[2]] +trip_id: [[1000371],[1000374],[1000373],[1000372]] +trip_distance: [[1.8],[8.4],[0.9],[2.5]] +fare_amount: [[15.32],[42.13],[9.01],[22.15]] +store_and_fwd_flag: [["N"],["Y"],["N"],["N"]] +ts: [[2024-01-01 09:15:23.000000],[2024-01-03 07:12:33.000000],[2024-01-01 03:25:15.000000],[2024-01-02 12:10:11.000000]] +``` + +读取为 Pandas DataFrame: + +```python +print(table.scan().to_pandas()) + +vendor_id trip_id trip_distance fare_amount store_and_fwd_flag ts +0 1 1000371 1.8 15.32 N 2024-01-01 09:15:23 +1 1 1000374 8.4 42.13 Y 2024-01-03 07:12:33 +2 2 1000373 0.9 9.01 N 2024-01-01 03:25:15 +3 2 1000372 2.5 22.15 N 2024-01-02 12:10:11 +``` + +读取为 Polars DataFrame: + +```python +import polars as pl + +print(pl.scan_iceberg(table).collect()) + +shape: (4, 6) +┌───────────┬─────────┬───────────────┬─────────────┬────────────────────┬─────────────────────┐ +│ vendor_id ┆ trip_id ┆ trip_distance ┆ fare_amount ┆ store_and_fwd_flag ┆ ts │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ i64 ┆ i64 ┆ f32 ┆ f64 ┆ str ┆ datetime[μs] │ +╞═══════════╪═════════╪═══════════════╪═════════════╪════════════════════╪═════════════════════╡ +│ 1 ┆ 1000371 ┆ 1.8 ┆ 15.32 ┆ N ┆ 2024-01-01 09:15:23 │ +│ 1 ┆ 1000374 ┆ 8.4 ┆ 42.13 ┆ Y ┆ 2024-01-03 07:12:33 │ +│ 2 ┆ 1000373 ┆ 0.9 ┆ 9.01 ┆ N ┆ 2024-01-01 03:25:15 │ +│ 2 ┆ 1000372 ┆ 2.5 ┆ 22.15 ┆ N ┆ 2024-01-02 12:10:11 │ +└───────────┴─────────┴───────────────┴─────────────┴────────────────────┴─────────────────────┘ +``` + +> 通过 pyiceberg 写入 iceberg 数据,请参阅[步骤](#通过-pyiceberg-写入数据) + +### 08 附录 + +#### 通过 PyIceberg 写入数据 + +加载 Iceberg 表: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +Arrow Table 写入 Iceberg: + +```python +import pyarrow as pa + +df = pa.Table.from_pydict( + { + "vendor_id": pa.array([1, 2, 2, 1], pa.int64()), + "trip_id": pa.array([1000371, 1000372, 1000373, 1000374], pa.int64()), + "trip_distance": pa.array([1.8, 2.5, 0.9, 8.4], pa.float32()), + "fare_amount": pa.array([15.32, 22.15, 9.01, 42.13], pa.float64()), + "store_and_fwd_flag": pa.array(["N", "N", "N", "Y"], pa.string()), + "ts": pa.compute.strptime( + ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + "%Y-%m-%d %H:%M:%S", + "us", + ), + } +) +table.append(df) +``` + +Pandas DataFrame 写入 Iceberg: + +```python +import pyarrow as pa +import pandas as pd + +df = pd.DataFrame( + { + "vendor_id": pd.Series([1, 2, 2, 1]).astype("int64[pyarrow]"), + "trip_id": pd.Series([1000371, 1000372, 1000373, 1000374]).astype("int64[pyarrow]"), + "trip_distance": pd.Series([1.8, 2.5, 0.9, 8.4]).astype("float32[pyarrow]"), + "fare_amount": pd.Series([15.32, 22.15, 9.01, 42.13]).astype("float64[pyarrow]"), + "store_and_fwd_flag": pd.Series(["N", "N", "N", "Y"]).astype("string[pyarrow]"), + "ts": pd.Series(["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"]).astype("timestamp[us][pyarrow]"), + } +) +table.append(pa.Table.from_pandas(df)) +``` + +Polars DataFrame 写入 Iceberg: + +```python +import polars as pl + +df = pl.DataFrame( + { + "vendor_id": [1, 2, 2, 1], + "trip_id": [1000371, 1000372, 1000373, 1000374], + "trip_distance": [1.8, 2.5, 0.9, 8.4], + "fare_amount": [15.32, 22.15, 9.01, 42.13], + "store_and_fwd_flag": ["N", "N", "N", "Y"], + "ts": ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + }, + { + "vendor_id": pl.Int64, + "trip_id": pl.Int64, + "trip_distance": pl.Float32, + "fare_amount": pl.Float64, + "store_and_fwd_flag": pl.String, + "ts": pl.String, + }, +).with_columns(pl.col("ts").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")) +table.append(df.to_arrow()) +``` + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md index 3cc43ab17e4..16fc1aa20ec 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md @@ -304,3 +304,170 @@ mysql> SELECT * FROM iceberg.nyc.taxis FOR TIME AS OF "2024-07-29 03:40:22"; +-----------+---------+---------------+-------------+--------------------+----------------------------+ 4 rows in set (0.05 sec) ``` + +### 07 与 PyIceberg 交互 + +> 请使用 Doris 2.1.8/3.0.4 以上版本。 + +加载 Iceberg 表: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +读取为 Arrow Table: + +```python +print(table.scan().to_arrow()) + +pyarrow.Table +vendor_id: int64 +trip_id: int64 +trip_distance: float +fare_amount: double +store_and_fwd_flag: large_string +ts: timestamp[us] +---- +vendor_id: [[1],[1],[2],[2]] +trip_id: [[1000371],[1000374],[1000373],[1000372]] +trip_distance: [[1.8],[8.4],[0.9],[2.5]] +fare_amount: [[15.32],[42.13],[9.01],[22.15]] +store_and_fwd_flag: [["N"],["Y"],["N"],["N"]] +ts: [[2024-01-01 09:15:23.000000],[2024-01-03 07:12:33.000000],[2024-01-01 03:25:15.000000],[2024-01-02 12:10:11.000000]] +``` + +读取为 Pandas DataFrame: + +```python +print(table.scan().to_pandas()) + +vendor_id trip_id trip_distance fare_amount store_and_fwd_flag ts +0 1 1000371 1.8 15.32 N 2024-01-01 09:15:23 +1 1 1000374 8.4 42.13 Y 2024-01-03 07:12:33 +2 2 1000373 0.9 9.01 N 2024-01-01 03:25:15 +3 2 1000372 2.5 22.15 N 2024-01-02 12:10:11 +``` + +读取为 Polars DataFrame: + +```python +import polars as pl + +print(pl.scan_iceberg(table).collect()) + +shape: (4, 6) +┌───────────┬─────────┬───────────────┬─────────────┬────────────────────┬─────────────────────┐ +│ vendor_id ┆ trip_id ┆ trip_distance ┆ fare_amount ┆ store_and_fwd_flag ┆ ts │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ i64 ┆ i64 ┆ f32 ┆ f64 ┆ str ┆ datetime[μs] │ +╞═══════════╪═════════╪═══════════════╪═════════════╪════════════════════╪═════════════════════╡ +│ 1 ┆ 1000371 ┆ 1.8 ┆ 15.32 ┆ N ┆ 2024-01-01 09:15:23 │ +│ 1 ┆ 1000374 ┆ 8.4 ┆ 42.13 ┆ Y ┆ 2024-01-03 07:12:33 │ +│ 2 ┆ 1000373 ┆ 0.9 ┆ 9.01 ┆ N ┆ 2024-01-01 03:25:15 │ +│ 2 ┆ 1000372 ┆ 2.5 ┆ 22.15 ┆ N ┆ 2024-01-02 12:10:11 │ +└───────────┴─────────┴───────────────┴─────────────┴────────────────────┴─────────────────────┘ +``` + +> 通过 pyiceberg 写入 iceberg 数据,请参阅[步骤](#通过-pyiceberg-写入数据) + +### 08 附录 + +#### 通过 PyIceberg 写入数据 + +加载 Iceberg 表: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +Arrow Table 写入 Iceberg: + +```python +import pyarrow as pa + +df = pa.Table.from_pydict( + { + "vendor_id": pa.array([1, 2, 2, 1], pa.int64()), + "trip_id": pa.array([1000371, 1000372, 1000373, 1000374], pa.int64()), + "trip_distance": pa.array([1.8, 2.5, 0.9, 8.4], pa.float32()), + "fare_amount": pa.array([15.32, 22.15, 9.01, 42.13], pa.float64()), + "store_and_fwd_flag": pa.array(["N", "N", "N", "Y"], pa.string()), + "ts": pa.compute.strptime( + ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + "%Y-%m-%d %H:%M:%S", + "us", + ), + } +) +table.append(df) +``` + +Pandas DataFrame 写入 Iceberg: + +```python +import pyarrow as pa +import pandas as pd + +df = pd.DataFrame( + { + "vendor_id": pd.Series([1, 2, 2, 1]).astype("int64[pyarrow]"), + "trip_id": pd.Series([1000371, 1000372, 1000373, 1000374]).astype("int64[pyarrow]"), + "trip_distance": pd.Series([1.8, 2.5, 0.9, 8.4]).astype("float32[pyarrow]"), + "fare_amount": pd.Series([15.32, 22.15, 9.01, 42.13]).astype("float64[pyarrow]"), + "store_and_fwd_flag": pd.Series(["N", "N", "N", "Y"]).astype("string[pyarrow]"), + "ts": pd.Series(["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"]).astype("timestamp[us][pyarrow]"), + } +) +table.append(pa.Table.from_pandas(df)) +``` + +Polars DataFrame 写入 Iceberg: + +```python +import polars as pl + +df = pl.DataFrame( + { + "vendor_id": [1, 2, 2, 1], + "trip_id": [1000371, 1000372, 1000373, 1000374], + "trip_distance": [1.8, 2.5, 0.9, 8.4], + "fare_amount": [15.32, 22.15, 9.01, 42.13], + "store_and_fwd_flag": ["N", "N", "N", "Y"], + "ts": ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + }, + { + "vendor_id": pl.Int64, + "trip_id": pl.Int64, + "trip_distance": pl.Float32, + "fare_amount": pl.Float64, + "store_and_fwd_flag": pl.String, + "ts": pl.String, + }, +).with_columns(pl.col("ts").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")) +table.append(df.to_arrow()) +``` + diff --git a/versioned_docs/version-2.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md b/versioned_docs/version-2.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md index 3a6159407bc..c4f3d3438fd 100644 --- a/versioned_docs/version-2.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md +++ b/versioned_docs/version-2.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md @@ -302,3 +302,169 @@ mysql> SELECT * FROM iceberg.nyc.taxis FOR TIME AS OF "2024-07-29 03:40:22"; +-----------+---------+---------------+-------------+--------------------+----------------------------+ 4 rows in set (0.05 sec) ``` + +### 07 Interacting with PyIceberg + +> Please use Doris 2.1.8/3.0.4 or above. + +Load an iceberg table: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +Read table as `Arrow Table`: + +```python +print(table.scan().to_arrow()) + +pyarrow.Table +vendor_id: int64 +trip_id: int64 +trip_distance: float +fare_amount: double +store_and_fwd_flag: large_string +ts: timestamp[us] +---- +vendor_id: [[1],[1],[2],[2]] +trip_id: [[1000371],[1000374],[1000373],[1000372]] +trip_distance: [[1.8],[8.4],[0.9],[2.5]] +fare_amount: [[15.32],[42.13],[9.01],[22.15]] +store_and_fwd_flag: [["N"],["Y"],["N"],["N"]] +ts: [[2024-01-01 09:15:23.000000],[2024-01-03 07:12:33.000000],[2024-01-01 03:25:15.000000],[2024-01-02 12:10:11.000000]] +``` + +Read table as `Pandas DataFrame`: + +```python +print(table.scan().to_pandas()) + +vendor_id trip_id trip_distance fare_amount store_and_fwd_flag ts +0 1 1000371 1.8 15.32 N 2024-01-01 09:15:23 +1 1 1000374 8.4 42.13 Y 2024-01-03 07:12:33 +2 2 1000373 0.9 9.01 N 2024-01-01 03:25:15 +3 2 1000372 2.5 22.15 N 2024-01-02 12:10:11 +``` + +Read table as `Polars DataFrame`: + +```python +import polars as pl + +print(pl.scan_iceberg(table).collect()) + +shape: (4, 6) +┌───────────┬─────────┬───────────────┬─────────────┬────────────────────┬─────────────────────┐ +│ vendor_id ┆ trip_id ┆ trip_distance ┆ fare_amount ┆ store_and_fwd_flag ┆ ts │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ i64 ┆ i64 ┆ f32 ┆ f64 ┆ str ┆ datetime[μs] │ +╞═══════════╪═════════╪═══════════════╪═════════════╪════════════════════╪═════════════════════╡ +│ 1 ┆ 1000371 ┆ 1.8 ┆ 15.32 ┆ N ┆ 2024-01-01 09:15:23 │ +│ 1 ┆ 1000374 ┆ 8.4 ┆ 42.13 ┆ Y ┆ 2024-01-03 07:12:33 │ +│ 2 ┆ 1000373 ┆ 0.9 ┆ 9.01 ┆ N ┆ 2024-01-01 03:25:15 │ +│ 2 ┆ 1000372 ┆ 2.5 ┆ 22.15 ┆ N ┆ 2024-01-02 12:10:11 │ +└───────────┴─────────┴───────────────┴─────────────┴────────────────────┴─────────────────────┘ +``` + +> Write iceberg table by PyIceberg, please see [step](#write-iceberg-table-by-pyiceberg) + +### 08 Appendix + +#### Write iceberg table by PyIceberg + +Load an iceberg table: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +Write table with `Arrow Table` : + +```python +import pyarrow as pa + +df = pa.Table.from_pydict( + { + "vendor_id": pa.array([1, 2, 2, 1], pa.int64()), + "trip_id": pa.array([1000371, 1000372, 1000373, 1000374], pa.int64()), + "trip_distance": pa.array([1.8, 2.5, 0.9, 8.4], pa.float32()), + "fare_amount": pa.array([15.32, 22.15, 9.01, 42.13], pa.float64()), + "store_and_fwd_flag": pa.array(["N", "N", "N", "Y"], pa.string()), + "ts": pa.compute.strptime( + ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + "%Y-%m-%d %H:%M:%S", + "us", + ), + } +) +table.append(df) +``` + +Write table with `Pandas DataFrame` : + +```python +import pyarrow as pa +import pandas as pd + +df = pd.DataFrame( + { + "vendor_id": pd.Series([1, 2, 2, 1]).astype("int64[pyarrow]"), + "trip_id": pd.Series([1000371, 1000372, 1000373, 1000374]).astype("int64[pyarrow]"), + "trip_distance": pd.Series([1.8, 2.5, 0.9, 8.4]).astype("float32[pyarrow]"), + "fare_amount": pd.Series([15.32, 22.15, 9.01, 42.13]).astype("float64[pyarrow]"), + "store_and_fwd_flag": pd.Series(["N", "N", "N", "Y"]).astype("string[pyarrow]"), + "ts": pd.Series(["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"]).astype("timestamp[us][pyarrow]"), + } +) +table.append(pa.Table.from_pandas(df)) +``` + +Write table with `Polars DataFrame` : + +```python +import polars as pl + +df = pl.DataFrame( + { + "vendor_id": [1, 2, 2, 1], + "trip_id": [1000371, 1000372, 1000373, 1000374], + "trip_distance": [1.8, 2.5, 0.9, 8.4], + "fare_amount": [15.32, 22.15, 9.01, 42.13], + "store_and_fwd_flag": ["N", "N", "N", "Y"], + "ts": ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + }, + { + "vendor_id": pl.Int64, + "trip_id": pl.Int64, + "trip_distance": pl.Float32, + "fare_amount": pl.Float64, + "store_and_fwd_flag": pl.String, + "ts": pl.String, + }, +).with_columns(pl.col("ts").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")) +table.append(df.to_arrow()) +``` diff --git a/versioned_docs/version-2.1/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md b/versioned_docs/version-2.1/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md index 3a6159407bc..c4f3d3438fd 100644 --- a/versioned_docs/version-2.1/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md +++ b/versioned_docs/version-2.1/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md @@ -302,3 +302,169 @@ mysql> SELECT * FROM iceberg.nyc.taxis FOR TIME AS OF "2024-07-29 03:40:22"; +-----------+---------+---------------+-------------+--------------------+----------------------------+ 4 rows in set (0.05 sec) ``` + +### 07 Interacting with PyIceberg + +> Please use Doris 2.1.8/3.0.4 or above. + +Load an iceberg table: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +Read table as `Arrow Table`: + +```python +print(table.scan().to_arrow()) + +pyarrow.Table +vendor_id: int64 +trip_id: int64 +trip_distance: float +fare_amount: double +store_and_fwd_flag: large_string +ts: timestamp[us] +---- +vendor_id: [[1],[1],[2],[2]] +trip_id: [[1000371],[1000374],[1000373],[1000372]] +trip_distance: [[1.8],[8.4],[0.9],[2.5]] +fare_amount: [[15.32],[42.13],[9.01],[22.15]] +store_and_fwd_flag: [["N"],["Y"],["N"],["N"]] +ts: [[2024-01-01 09:15:23.000000],[2024-01-03 07:12:33.000000],[2024-01-01 03:25:15.000000],[2024-01-02 12:10:11.000000]] +``` + +Read table as `Pandas DataFrame`: + +```python +print(table.scan().to_pandas()) + +vendor_id trip_id trip_distance fare_amount store_and_fwd_flag ts +0 1 1000371 1.8 15.32 N 2024-01-01 09:15:23 +1 1 1000374 8.4 42.13 Y 2024-01-03 07:12:33 +2 2 1000373 0.9 9.01 N 2024-01-01 03:25:15 +3 2 1000372 2.5 22.15 N 2024-01-02 12:10:11 +``` + +Read table as `Polars DataFrame`: + +```python +import polars as pl + +print(pl.scan_iceberg(table).collect()) + +shape: (4, 6) +┌───────────┬─────────┬───────────────┬─────────────┬────────────────────┬─────────────────────┐ +│ vendor_id ┆ trip_id ┆ trip_distance ┆ fare_amount ┆ store_and_fwd_flag ┆ ts │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ i64 ┆ i64 ┆ f32 ┆ f64 ┆ str ┆ datetime[μs] │ +╞═══════════╪═════════╪═══════════════╪═════════════╪════════════════════╪═════════════════════╡ +│ 1 ┆ 1000371 ┆ 1.8 ┆ 15.32 ┆ N ┆ 2024-01-01 09:15:23 │ +│ 1 ┆ 1000374 ┆ 8.4 ┆ 42.13 ┆ Y ┆ 2024-01-03 07:12:33 │ +│ 2 ┆ 1000373 ┆ 0.9 ┆ 9.01 ┆ N ┆ 2024-01-01 03:25:15 │ +│ 2 ┆ 1000372 ┆ 2.5 ┆ 22.15 ┆ N ┆ 2024-01-02 12:10:11 │ +└───────────┴─────────┴───────────────┴─────────────┴────────────────────┴─────────────────────┘ +``` + +> Write iceberg table by PyIceberg, please see [step](#write-iceberg-table-by-pyiceberg) + +### 08 Appendix + +#### Write iceberg table by PyIceberg + +Load an iceberg table: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +Write table with `Arrow Table` : + +```python +import pyarrow as pa + +df = pa.Table.from_pydict( + { + "vendor_id": pa.array([1, 2, 2, 1], pa.int64()), + "trip_id": pa.array([1000371, 1000372, 1000373, 1000374], pa.int64()), + "trip_distance": pa.array([1.8, 2.5, 0.9, 8.4], pa.float32()), + "fare_amount": pa.array([15.32, 22.15, 9.01, 42.13], pa.float64()), + "store_and_fwd_flag": pa.array(["N", "N", "N", "Y"], pa.string()), + "ts": pa.compute.strptime( + ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + "%Y-%m-%d %H:%M:%S", + "us", + ), + } +) +table.append(df) +``` + +Write table with `Pandas DataFrame` : + +```python +import pyarrow as pa +import pandas as pd + +df = pd.DataFrame( + { + "vendor_id": pd.Series([1, 2, 2, 1]).astype("int64[pyarrow]"), + "trip_id": pd.Series([1000371, 1000372, 1000373, 1000374]).astype("int64[pyarrow]"), + "trip_distance": pd.Series([1.8, 2.5, 0.9, 8.4]).astype("float32[pyarrow]"), + "fare_amount": pd.Series([15.32, 22.15, 9.01, 42.13]).astype("float64[pyarrow]"), + "store_and_fwd_flag": pd.Series(["N", "N", "N", "Y"]).astype("string[pyarrow]"), + "ts": pd.Series(["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"]).astype("timestamp[us][pyarrow]"), + } +) +table.append(pa.Table.from_pandas(df)) +``` + +Write table with `Polars DataFrame` : + +```python +import polars as pl + +df = pl.DataFrame( + { + "vendor_id": [1, 2, 2, 1], + "trip_id": [1000371, 1000372, 1000373, 1000374], + "trip_distance": [1.8, 2.5, 0.9, 8.4], + "fare_amount": [15.32, 22.15, 9.01, 42.13], + "store_and_fwd_flag": ["N", "N", "N", "Y"], + "ts": ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + }, + { + "vendor_id": pl.Int64, + "trip_id": pl.Int64, + "trip_distance": pl.Float32, + "fare_amount": pl.Float64, + "store_and_fwd_flag": pl.String, + "ts": pl.String, + }, +).with_columns(pl.col("ts").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")) +table.append(df.to_arrow()) +``` diff --git a/versioned_docs/version-3.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md b/versioned_docs/version-3.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md index 3a6159407bc..c4f3d3438fd 100644 --- a/versioned_docs/version-3.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md +++ b/versioned_docs/version-3.0/gettingStarted/tutorials/building-lakehouse/doris-iceberg.md @@ -302,3 +302,169 @@ mysql> SELECT * FROM iceberg.nyc.taxis FOR TIME AS OF "2024-07-29 03:40:22"; +-----------+---------+---------------+-------------+--------------------+----------------------------+ 4 rows in set (0.05 sec) ``` + +### 07 Interacting with PyIceberg + +> Please use Doris 2.1.8/3.0.4 or above. + +Load an iceberg table: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +Read table as `Arrow Table`: + +```python +print(table.scan().to_arrow()) + +pyarrow.Table +vendor_id: int64 +trip_id: int64 +trip_distance: float +fare_amount: double +store_and_fwd_flag: large_string +ts: timestamp[us] +---- +vendor_id: [[1],[1],[2],[2]] +trip_id: [[1000371],[1000374],[1000373],[1000372]] +trip_distance: [[1.8],[8.4],[0.9],[2.5]] +fare_amount: [[15.32],[42.13],[9.01],[22.15]] +store_and_fwd_flag: [["N"],["Y"],["N"],["N"]] +ts: [[2024-01-01 09:15:23.000000],[2024-01-03 07:12:33.000000],[2024-01-01 03:25:15.000000],[2024-01-02 12:10:11.000000]] +``` + +Read table as `Pandas DataFrame`: + +```python +print(table.scan().to_pandas()) + +vendor_id trip_id trip_distance fare_amount store_and_fwd_flag ts +0 1 1000371 1.8 15.32 N 2024-01-01 09:15:23 +1 1 1000374 8.4 42.13 Y 2024-01-03 07:12:33 +2 2 1000373 0.9 9.01 N 2024-01-01 03:25:15 +3 2 1000372 2.5 22.15 N 2024-01-02 12:10:11 +``` + +Read table as `Polars DataFrame`: + +```python +import polars as pl + +print(pl.scan_iceberg(table).collect()) + +shape: (4, 6) +┌───────────┬─────────┬───────────────┬─────────────┬────────────────────┬─────────────────────┐ +│ vendor_id ┆ trip_id ┆ trip_distance ┆ fare_amount ┆ store_and_fwd_flag ┆ ts │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ i64 ┆ i64 ┆ f32 ┆ f64 ┆ str ┆ datetime[μs] │ +╞═══════════╪═════════╪═══════════════╪═════════════╪════════════════════╪═════════════════════╡ +│ 1 ┆ 1000371 ┆ 1.8 ┆ 15.32 ┆ N ┆ 2024-01-01 09:15:23 │ +│ 1 ┆ 1000374 ┆ 8.4 ┆ 42.13 ┆ Y ┆ 2024-01-03 07:12:33 │ +│ 2 ┆ 1000373 ┆ 0.9 ┆ 9.01 ┆ N ┆ 2024-01-01 03:25:15 │ +│ 2 ┆ 1000372 ┆ 2.5 ┆ 22.15 ┆ N ┆ 2024-01-02 12:10:11 │ +└───────────┴─────────┴───────────────┴─────────────┴────────────────────┴─────────────────────┘ +``` + +> Write iceberg table by PyIceberg, please see [step](#write-iceberg-table-by-pyiceberg) + +### 08 Appendix + +#### Write iceberg table by PyIceberg + +Load an iceberg table: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "iceberg", + **{ + "warehouse" = "warehouse", + "uri" = "http://rest:8181", + "s3.access-key-id" = "admin", + "s3.secret-access-key" = "password", + "s3.endpoint" = "http://minio:9000" + }, +) +table = catalog.load_table("nyc.taxis") +``` + +Write table with `Arrow Table` : + +```python +import pyarrow as pa + +df = pa.Table.from_pydict( + { + "vendor_id": pa.array([1, 2, 2, 1], pa.int64()), + "trip_id": pa.array([1000371, 1000372, 1000373, 1000374], pa.int64()), + "trip_distance": pa.array([1.8, 2.5, 0.9, 8.4], pa.float32()), + "fare_amount": pa.array([15.32, 22.15, 9.01, 42.13], pa.float64()), + "store_and_fwd_flag": pa.array(["N", "N", "N", "Y"], pa.string()), + "ts": pa.compute.strptime( + ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + "%Y-%m-%d %H:%M:%S", + "us", + ), + } +) +table.append(df) +``` + +Write table with `Pandas DataFrame` : + +```python +import pyarrow as pa +import pandas as pd + +df = pd.DataFrame( + { + "vendor_id": pd.Series([1, 2, 2, 1]).astype("int64[pyarrow]"), + "trip_id": pd.Series([1000371, 1000372, 1000373, 1000374]).astype("int64[pyarrow]"), + "trip_distance": pd.Series([1.8, 2.5, 0.9, 8.4]).astype("float32[pyarrow]"), + "fare_amount": pd.Series([15.32, 22.15, 9.01, 42.13]).astype("float64[pyarrow]"), + "store_and_fwd_flag": pd.Series(["N", "N", "N", "Y"]).astype("string[pyarrow]"), + "ts": pd.Series(["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"]).astype("timestamp[us][pyarrow]"), + } +) +table.append(pa.Table.from_pandas(df)) +``` + +Write table with `Polars DataFrame` : + +```python +import polars as pl + +df = pl.DataFrame( + { + "vendor_id": [1, 2, 2, 1], + "trip_id": [1000371, 1000372, 1000373, 1000374], + "trip_distance": [1.8, 2.5, 0.9, 8.4], + "fare_amount": [15.32, 22.15, 9.01, 42.13], + "store_and_fwd_flag": ["N", "N", "N", "Y"], + "ts": ["2024-01-01 9:15:23", "2024-01-02 12:10:11", "2024-01-01 3:25:15", "2024-01-03 7:12:33"], + }, + { + "vendor_id": pl.Int64, + "trip_id": pl.Int64, + "trip_distance": pl.Float32, + "fare_amount": pl.Float64, + "store_and_fwd_flag": pl.String, + "ts": pl.String, + }, +).with_columns(pl.col("ts").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")) +table.append(df.to_arrow()) +``` --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org