roryqi commented on code in PR #10778:
URL: https://github.com/apache/gravitino/pull/10778#discussion_r3084373715


##########
docs/iceberg-rest-engine/spark.md:
##########
@@ -0,0 +1,227 @@
+---
+title: Connect Spark via Iceberg REST
+sidebar_label: Spark
+---
+
+# Connecting Apache Spark via Iceberg REST
+
+Apache Gravitino exposes an [Iceberg REST catalog](../iceberg-rest-service.md) 
endpoint that any
+Iceberg-compatible engine can connect to directly — without installing a 
Gravitino-specific
+connector plugin. This page describes how to configure Apache Spark to use 
Gravitino's Iceberg REST
+(IRC) endpoint.
+
+:::note
+This integration uses the standard Apache Iceberg REST catalog specification. 
Gravitino enforces
+its full access-control model on all IRC requests. Per-user identity 
propagation from the engine is
+planned for a future release; current requests are authorized using the 
credentials supplied in the
+Spark configuration.
+:::
+
+## Prerequisites
+
+- Apache Gravitino running with the Iceberg REST service enabled. See
+  [Iceberg REST catalog service](../iceberg-rest-service.md) for setup 
instructions.
+- The Gravitino IRC endpoint is accessible from your Spark environment. The 
default port is `9001`.
+- The following JAR files available in your Spark environment:
+  - `iceberg-spark-runtime-3.5_2.12-1.7.1.jar`
+  - `hadoop-aws-3.3.4.jar`
+  - `aws-bundle-2.29.38.jar`
+
+This page uses **Spark 3.5.3** with **Iceberg 1.7.1**. For other versions, 
ensure compatibility
+between Spark, Scala, and Iceberg runtime versions.
+
+## Configuration
+
+`spark-defaults.conf` is Spark's persistent configuration file. Properties set 
here are
+automatically applied to every Spark session — no command-line flags needed. 
The file lives at:
+
+```
+$SPARK_HOME/conf/spark-defaults.conf
+```
+
+If the file doesn't exist yet, copy the template:
+
+```bash
+cp $SPARK_HOME/conf/spark-defaults.conf.template 
$SPARK_HOME/conf/spark-defaults.conf
+```
+
+### Simple authentication
+
+Add the following to `$SPARK_HOME/conf/spark-defaults.conf`:
+
+```properties
+# Iceberg extensions
+spark.sql.extensions                                    
org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
+
+# Gravitino IRC catalog
+spark.sql.catalog.gravitino_irc                         
org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.gravitino_irc.type                    rest
+spark.sql.catalog.gravitino_irc.uri                     
http://<gravitino-host>:9001/iceberg
+
+# S3 FileIO
+spark.sql.catalog.gravitino_irc.io-impl                 
org.apache.iceberg.aws.s3.S3FileIO
+spark.sql.catalog.gravitino_irc.s3.region               us-east-1
+spark.sql.catalog.gravitino_irc.s3.access-key-id        <access-key>
+spark.sql.catalog.gravitino_irc.s3.secret-access-key    <secret-key>
+
+# Hadoop S3A (for s3a:// paths)
+spark.hadoop.fs.s3a.impl                                
org.apache.hadoop.fs.s3a.S3AFileSystem
+
+# Set as default catalog (optional)
+spark.sql.defaultCatalog                                gravitino_irc
+```
+
+:::note
+`gravitino_irc` is the catalog identifier used within Spark. It maps to the 
Gravitino IRC endpoint
+via the `uri` property. You may use any identifier you prefer. S3 credentials 
can alternatively
+be supplied via environment variables (`AWS_ACCESS_KEY_ID` / 
`AWS_SECRET_ACCESS_KEY`) or an IAM
+instance profile, in which case the explicit credential lines can be omitted.
+:::
+
+### With OAuth2 authentication
+
+If Gravitino is configured with OAuth2, add the auth properties to the same
+`$SPARK_HOME/conf/spark-defaults.conf` file:
+
+```properties
+# Iceberg extensions
+spark.sql.extensions                                    
org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
+
+# Gravitino IRC catalog
+spark.sql.catalog.gravitino_irc                         
org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.gravitino_irc.type                    rest
+spark.sql.catalog.gravitino_irc.uri                     
http://<gravitino-host>:9001/iceberg
+
+# OAuth2 authentication
+spark.sql.catalog.gravitino_irc.rest.auth.type          oauth2
+spark.sql.catalog.gravitino_irc.rest.auth.oauth2.token  <your-token>
+
+# S3 FileIO
+spark.sql.catalog.gravitino_irc.io-impl                 
org.apache.iceberg.aws.s3.S3FileIO
+spark.sql.catalog.gravitino_irc.s3.region               us-east-1
+spark.sql.catalog.gravitino_irc.s3.access-key-id        <access-key>
+spark.sql.catalog.gravitino_irc.s3.secret-access-key    <secret-key>
+
+# Hadoop S3A (for s3a:// paths)
+spark.hadoop.fs.s3a.impl                                
org.apache.hadoop.fs.s3a.S3AFileSystem
+
+# Set as default catalog (optional)
+spark.sql.defaultCatalog                                gravitino_irc
+```
+
+See [How to authenticate](../security/how-to-authenticate.md) for Gravitino 
authentication
+configuration options.
+
+:::tip Local development
+For local development, [MinIO](https://min.io) can be used as an S3-compatible 
storage backend.
+Replace the S3 FileIO section with:
+
+```properties
+spark.sql.catalog.gravitino_irc.io-impl                 
org.apache.iceberg.aws.s3.S3FileIO
+spark.sql.catalog.gravitino_irc.s3.endpoint             
http://<minio-host>:9000
+spark.sql.catalog.gravitino_irc.s3.path-style-access    true
+spark.sql.catalog.gravitino_irc.s3.access-key-id        <minio-access-key>
+spark.sql.catalog.gravitino_irc.s3.secret-access-key    <minio-secret-key>
+spark.hadoop.fs.s3a.impl                                
org.apache.hadoop.fs.s3a.S3AFileSystem
+spark.hadoop.fs.s3a.endpoint                            
http://<minio-host>:9000
+spark.hadoop.fs.s3a.path.style.access                   true
+spark.hadoop.fs.s3a.connection.ssl.enabled              false
+```
+
+See 
[gravitino-irc-quickstart](https://github.com/markhoerth/gravitino-irc-quickstart)
 for a
+complete local development environment using MinIO.
+:::
+
+### With credential vending
+
+If Gravitino is configured with credential vending, add the following to 
enable it on the client side:
+
+```properties
+spark.sql.catalog.gravitino_irc.header.X-Iceberg-Access-Delegation    
vended-credentials
+```
+
+See [Credential vending](../iceberg-rest-service.md#credential-vending) for 
server-side configuration.
+
+:::note
+For storage not managed by Gravitino, properties are not automatically 
transferred from the server
+to the client. Pass custom properties to initialize FileIO explicitly:
+
+```properties
+spark.sql.catalog.gravitino_irc.<configuration-key>    <property-value>
+```
+:::
+
+## Starting Spark
+
+Once `spark-defaults.conf` is in place, start your Spark session normally. The 
Gravitino IRC
+catalog is available immediately without any additional flags.
+
+### spark-shell (Scala)
+
+```bash
+$SPARK_HOME/bin/spark-shell
+```
+
+### spark-sql
+
+```bash
+$SPARK_HOME/bin/spark-sql
+```
+
+### pyspark
+
+```bash
+$SPARK_HOME/bin/pyspark
+```
+
+## Usage examples
+
+### List namespaces
+
+```sql
+SHOW NAMESPACES IN gravitino_irc;
+```
+
+### List tables
+
+```sql
+SHOW TABLES IN gravitino_irc.<namespace>;
+```
+
+### Query a table
+
+```sql
+SELECT * FROM gravitino_irc.<namespace>.<table> LIMIT 10;
+```
+
+### Create a table
+
+```sql
+CREATE TABLE gravitino_irc.<namespace>.new_table (
+  id INT,
+  name STRING,
+  created_at TIMESTAMP
+) USING iceberg;
+```
+
+### Insert data
+
+```sql
+INSERT INTO gravitino_irc.<namespace>.new_table VALUES (1, 'example', 
current_timestamp());
+```
+
+## Gravitino connector vs Iceberg REST
+
+| Feature | Gravitino Engine Connector | Iceberg REST |
+|:---|:---|:---|
+| Engine plugin required | Yes | No |

Review Comment:
   Could u align the table columns?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to