[I] Inconsistent initial field ids with REST catalog [iceberg]

via GitHub Wed, 03 Apr 2024 14:38:48 -0700


mrcnc opened a new issue, #10084:
URL: https://github.com/apache/iceberg/issues/10084


   ### Query engine
   
   Using Spark 3.4.0 with Iceberg 1.4.3
   
   ### Question
   
   To reproduce this behavior you can start a spark shell configured with 2 
catalogs
   
   ```
   SPARK_VERSION=3.4
   ICEBERG_VERSION=1.4.3
   
   $SPARK_HOME/bin/spark-shell \
     
--packages="org.apache.iceberg:iceberg-spark-runtime-${SPARK_VERSION}_2.12:${ICEBERG_VERSION}"
 \
     --conf spark.driver.host=127.0.0.1 \
     --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
     --conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
     --conf spark.sql.catalog.rest=org.apache.iceberg.spark.SparkCatalog \
     --conf 
spark.sql.catalog.rest.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
     --conf spark.sql.catalog.rest.uri=http://127.0.0.1:8080/catalog/ \
     --conf spark.sql.catalog.rest.warehouse=/tmp/warehouse/rest \
     --conf spark.sql.defaultCatalog=rest \
     --conf spark.sql.catalog.hadoop=org.apache.iceberg.spark.SparkCatalog \
     --conf spark.sql.catalog.hadoop.type=hadoop \
     --conf spark.sql.catalog.hadoop.warehouse=/tmp/warehouse/hadoop \
     --conf spark.sql.warehouse.dir=/tmp/warehouse/
   ```
   
   First you can create a table in the rest catalog with 
   
   ```
   spark.sql("CREATE TABLE rest.test.table1(id bigint, data 
string)").show(false)
   ```
   
   And you can see the request body sent to the createTable endpoint will look 
like this
   
   ```json
   {
     "name": "table1",
     "location": null,
     "schema": {
       "type": "struct",
       "schema-id": 0,
       "fields": [
         {
           "id": 0,
           "name": "id",
           "required": false,
           "type": "long"
         },
         {
           "id": 1,
           "name": "data",
           "required": false,
           "type": "string"
         }
       ]
     },
     "partition-spec": {
       "spec-id": 0,
       "fields": []
     },
     "write-order": null,
     "properties": {
       "owner": "your.name"
     },
     "stage-create": false
   }
   ```
   
   And if you create the same table in the hadoop catalog with
   
   ```
   spark.sql("CREATE TABLE haddop.test.table1(id bigint, data 
string)").show(false)
   ```
   
   it will write the metadata file 
`/tmp/warehouse/hadoop/test/table1/metadata/v1.metadata.json`
   with contents like this
   
   ```json
   {
     "format-version": 2,
     "table-uuid": "d1768dd2-cacd-45c0-b6ae-2481292e7682",
     "location": "/tmp/warehouse/hadoop/test/table1",
     "last-sequence-number": 0,
     "last-updated-ms": 1712168876272,
     "last-column-id": 2,
     "current-schema-id": 0,
     "schemas": [
       {
         "type": "struct",
         "schema-id": 0,
         "fields": [
           {
             "id": 1,
             "name": "id",
             "required": false,
             "type": "long"
           },
           {
             "id": 2,
             "name": "data",
             "required": false,
             "type": "string"
           }
         ]
       }
     ],
     "default-spec-id": 0,
     "partition-specs": [
       {
         "spec-id": 0,
         "fields": []
       }
     ],
     "last-partition-id": 999,
     "default-sort-order-id": 0,
     "sort-orders": [
       {
         "order-id": 0,
         "fields": []
       }
     ],
     "properties": {
       "owner": "your.name",
       "write.parquet.compression-codec": "zstd"
     },
     "current-snapshot-id": -1,
     "refs": {},
     "snapshots": [],
     "statistics": [],
     "snapshot-log": [],
     "metadata-log": []
   }
   ```
   
   You can see that the field ids in the schema differ between these catalogs 
(rest starts at 0, hadoop starts at 1).  I find it odd that this would be 
different across catalogs so I'm wondering if this is expected behavior?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Inconsistent initial field ids with REST catalog [iceberg]

Reply via email to