Re: [PR] [docs] Replace examples of Hadoop catalog with JDBC catalog [iceberg]

via GitHub Sun, 22 Dec 2024 14:26:52 -0800


kevinjqliu commented on code in PR #11845:
URL: https://github.com/apache/iceberg/pull/11845#discussion_r1895040536



##########
docs/docs/spark-getting-started.md:
##########


Review Comment:
   note, there are two "getting started" docs
   this one and `site/docs/spark-quickstart.md` 



##########
site/docs/spark-quickstart.md:
##########
@@ -26,7 +26,11 @@ highlight some powerful features. You can learn more about 
Iceberg's Spark runti
 - [Writing Data to a Table](#writing-data-to-a-table)
 - [Reading Data from a Table](#reading-data-from-a-table)
 - [Adding A Catalog](#adding-a-catalog)
-- [Next Steps](#next-steps)
+    - [Configuring JDBC Catalog](#configuring-jdbc-catalog)
+    - [Configuring REST Catalog](#configuring-rest-catalog)
+- [Next steps](#next-steps)
+    - [Adding Iceberg to Spark](#adding-iceberg-to-spark)
+    - [Learn More](#learn-more)

Review Comment:
   renders the subsection correctly 
   ![Screenshot 2024-12-22 at 1 22 48 
PM](https://github.com/user-attachments/assets/d6fb0eca-8f10-4e84-809b-b16b84061854)
   



##########
site/docs/spark-quickstart.md:
##########
@@ -269,42 +273,104 @@ To read a table, simply use the Iceberg table's name.
 
 ### Adding A Catalog
 
-Iceberg has several catalog back-ends that can be used to track tables, like 
JDBC, Hive MetaStore and Glue.
-Catalogs are configured using properties under 
`spark.sql.catalog.(catalog_name)`. In this guide,
-we use JDBC, but you can follow these instructions to configure other catalog 
types. To learn more, check out

Review Comment:
   weird that the guide already mention JDBC here, but the example is still 
hadoop



##########
site/docs/spark-quickstart.md:
##########
@@ -267,44 +271,109 @@ To read a table, simply use the Iceberg table's name.
     df = spark.table("demo.nyc.taxis").show()
     ```
 
-### Adding A Catalog
+### Adding catalogs
 
-Iceberg has several catalog back-ends that can be used to track tables, like 
JDBC, Hive MetaStore and Glue.
-Catalogs are configured using properties under 
`spark.sql.catalog.(catalog_name)`. In this guide,
-we use JDBC, but you can follow these instructions to configure other catalog 
types. To learn more, check out
-the [Catalog](docs/latest/spark-configuration.md#catalogs) page in the Spark 
section.
+Apache Iceberg provides several catalog implementations to manage tables and 
enable SQL operations. 
+Catalogs are configured using properties under 
`spark.sql.catalog.(catalog_name)`.
+You can configure different catalog types, such as JDBC, Hive Metastore, Glue, 
and REST, to manage Iceberg tables in Spark.
 
-This configuration creates a path-based catalog named `local` for tables under 
`$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in 
catalog.
+This guide covers the configuration of two popular catalog types:
+
+* JDBC Catalog
+* REST Catalog
+
+To learn more, check out the 
[Catalog](docs/latest/spark-configuration.md#catalogs) page in the Spark 
section.
+
+#### Configuring JDBC Catalog
+
+The JDBC catalog stores Iceberg table metadata in a relational database. 
+
+This configuration creates a JDBC-based catalog named `local` for tables under 
`$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in 
catalog.
+
+The JDBC catalog uses file-based SQLite database as the backend.
 
 === "CLI"
 
     ```sh
-    spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ 
icebergVersion }}\
+    spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ 
icebergVersion }},org.xerial:sqlite-jdbc:3.46.1.3 \
         --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
         --conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
         --conf spark.sql.catalog.spark_catalog.type=hive \
         --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
-        --conf spark.sql.catalog.local.type=hadoop \
+        --conf spark.sql.catalog.local.type=jdbc \
+        --conf 
spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
         --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
         --conf spark.sql.defaultCatalog=local
     ```
 
 === "spark-defaults.conf"
 
     ```sh
-    spark.jars.packages                                  
org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
+    spark.jars.packages                                  
org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion 
}},org.xerial:sqlite-jdbc:3.46.1.3
     spark.sql.extensions                                 
org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
     spark.sql.catalog.spark_catalog                      
org.apache.iceberg.spark.SparkSessionCatalog
     spark.sql.catalog.spark_catalog.type                 hive
     spark.sql.catalog.local                              
org.apache.iceberg.spark.SparkCatalog
-    spark.sql.catalog.local.type                         hadoop
-    spark.sql.catalog.local.warehouse                    $PWD/warehouse

Review Comment:
   `$PWD` does not expand in `spark-defaults.conf`. keeping this here will 
create a folder named `$PWD`



##########
docs/docs/spark-getting-started.md:
##########
@@ -41,20 +41,26 @@ spark-shell --packages 
org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ iceb
 
 ### Adding catalogs
 
-Iceberg comes with [catalogs](spark-configuration.md#catalogs) that enable SQL 
commands to manage tables and load them by name. Catalogs are configured using 
properties under `spark.sql.catalog.(catalog_name)`.
+Apache Iceberg provides several [catalog](spark-configuration.md#catalogs) 
implementations to manage tables and enable SQL operations. 
+Catalogs are configured using properties under 
`spark.sql.catalog.(catalog_name)`.
 
-This command creates a path-based catalog named `local` for tables under 
`$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in 
catalog:
+This command creates a JDBC-based catalog named `local` for tables under 
`$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in 
catalog. 
+
+The JDBC catalog uses file-based SQLite database as the backend.
 
 ```sh
-spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ 
icebergVersion }}\
+spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ 
icebergVersion }},org.xerial:sqlite-jdbc:3.46.1.3 \
     --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
     --conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
     --conf spark.sql.catalog.spark_catalog.type=hive \
     --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
-    --conf spark.sql.catalog.local.type=hadoop \
+    --conf spark.sql.catalog.local.type=jdbc \
+    --conf 
spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
     --conf spark.sql.catalog.local.warehouse=$PWD/warehouse
 ```
 
+For example configuring a REST-based catalog, see [Configuring REST 
Catalog](/spark-quickstart#configuring-rest-catalog)

Review Comment:
   instead of repeating here for configuring REST catalog, just link to 
`site/docs/spark-quickstart.md`. I double checked the link here locally



##########
docs/docs/spark-getting-started.md:
##########
@@ -41,20 +41,27 @@ spark-shell --packages 
org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ iceb
 
 ### Adding catalogs
 
-Iceberg comes with [catalogs](spark-configuration.md#catalogs) that enable SQL 
commands to manage tables and load them by name. Catalogs are configured using 
properties under `spark.sql.catalog.(catalog_name)`.
+Apache Iceberg provides several [catalog](spark-configuration.md#catalogs) 
implementations to manage tables and enable SQL operations. 
+Catalogs are configured using properties under 
`spark.sql.catalog.(catalog_name)`.
 
-This command creates a path-based catalog named `local` for tables under 
`$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in 
catalog:
+This command creates a JDBC-based catalog named `local` for tables under 
`$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in 
catalog. 
+
+The JDBC catalog uses file-based SQLite database as the backend.
 
 ```sh
-spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ 
icebergVersion }}\
+spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ 
icebergVersion }},org.xerial:sqlite-jdbc:3.46.1.3 \
     --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
     --conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
     --conf spark.sql.catalog.spark_catalog.type=hive \
     --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
-    --conf spark.sql.catalog.local.type=hadoop \
-    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse
+    --conf spark.sql.catalog.local.type=jdbc \
+    --conf 
spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
+    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
+    --conf spark.sql.defaultCatalog=local

Review Comment:
   add `defaultCatalog` to match other pages



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [docs] Replace examples of Hadoop catalog with JDBC catalog [iceberg]

Reply via email to