[GitHub] [iceberg] bryanck commented on a diff in pull request #8171: Added documentation on getting started with GCS

via GitHub Sun, 06 Aug 2023 19:28:18 -0700


bryanck commented on code in PR #8171:
URL: https://github.com/apache/iceberg/pull/8171#discussion_r1285324824



##########
docs/gcs.md:
##########
@@ -0,0 +1,139 @@
+---
+title: "GCS"
+url: gcs
+menu:
+    main:
+        parent: Integrations
+        identifier: gcs_integration
+        weight: 0
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements. See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License. You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg GSC Integration
+
+Google Cloud Storage (GCS) is a scalable object storage service known for its 
durability, throughput, and availability. It is designed to handle large 
amounts of unstructured data, making it an excellent choice for significant 
data operations. When GCS's storage capabilities are combined with Apache 
Iceberg's handling of large tabular datasets, a powerful tool for extensive 
data management is created. This combination allows for storing vast amounts of 
data and the execution of complex data operations directly on the data stored 
in GCS.
+  
+## Configuring Apache Iceberg to Use GCS with Spark
+
+To configure Apache Iceberg to use GCS with Spark, you can follow these steps:
+
+- **Add a catalog**: Iceberg supports multiple catalog back-ends for tracking 
tables. In this case, we will use the SparkSessionCatalog. You can configure 
the catalog by setting the following properties:
+- `spark.sql.catalog.spark_catalog`: Set this property to 
`org.apache.iceberg.spark.SparkSessionCatalog` to use the SparkSessionCatalog.
+- `spark.sql.catalog.spark_catalog.type`: Set this property to `hive` to use 
the Hive catalog type.
+- `spark.sql.catalog.local`: Set this property to 
`org.apache.iceberg.spark.SparkCatalog` to use the SparkCatalog.
+- `spark.sql.catalog.local.type`: Set this property to `hadoop` to use the 
Hadoop catalog type.
+- `spark.sql.catalog.local.warehouse`: Set this property to the path where you 
want to store the Iceberg tables in GCS.
+
+Here is an example of how to configure the catalog using the Spark SQL command 
line:
+
+```bash
+shell spark-sql --packages 
org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1
+--conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
+--conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
+--conf spark.sql.catalog.spark_catalog.type=hive
+--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog
+--conf spark.sql.catalog.local.type=hadoop
+--conf spark.sql.catalog.local.warehouse=<GCS_PATH>
+```
+
+- **Add Iceberg to Spark**: If you already have a Spark environment, you can 
add Iceberg by specifying the `--packages` option when starting Spark. This 
will download the required Iceberg package and make it available in your Spark 
session. Here is an example:
+
+```bash
+shell spark-shell --packages 
org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1

Review Comment:
   This currently isn't enough. The runtime doesn't currently contain 
`iceberg-gcp` nor does it contain the Google java client libraries. See 
https://github.com/apache/iceberg/pull/8231 for a proposal around this. You 
might also want to comment on configuring authentication, e.g. setting the 
`GOOGLE_APPLICATION_CREDENTIALS` environment variable or setting the GCSFileIO 
properties for that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] bryanck commented on a diff in pull request #8171: Added documentation on getting started with GCS

Reply via email to