[I] Iceberg table not keeping track of snapshots [iceberg-python]

via GitHub Mon, 26 Aug 2024 19:05:52 -0700


PetrasTYR opened a new issue, #1105:
URL: https://github.com/apache/iceberg-python/issues/1105


   ### Question
   
   Hello, I have a question regarding iceberg table snapshots. I used pyiceberg 
to create a namespace and a table, then insert a dataframe like so:
   ```
   from pyiceberg.catalog import load_rest
   from pyiceberg.schema import Schema
   from pyiceberg.types import StringType, NestedField, DoubleType
   from pyiceberg.partitioning import PartitionSpec, PartitionField
   import numpy as np
   import pandas as pd
   
   rows = 10**1
   ncols = 10
   countries = ["US", "CA", "UK"]
   attr2 = ["OPEN", "CLOSE", "LOW", "HIGH"]
   dates = pd.date_range("2020-01-01", "2020-12-31")
   data_orig = pd.DataFrame(
       [
           {
               "countries": countries[i % len(countries)],
               "status": attr2[i % len(attr2)],
               "return_index": np.random.rand(),
           }
           for i in range(rows)
       ]
   )
   
   schema = Schema(
       NestedField(field_id=1, name="countries", field_type=StringType(), 
required=False),
       NestedField(field_id=2, name="status", field_type=StringType(), 
required=False),
       NestedField(
           field_id=3, name="return_index", field_type=DoubleType(), 
required=False
       ),
   )
   partition_spec = PartitionSpec(
       fields=[
           PartitionField(
               source_id=1, field_id=1000, name="countries", 
transform="identity"
           ),
           PartitionField(source_id=2, field_id=1001, name="status", 
transform="identity"),
       ]
   )
   catalog = load_rest(
       "rest",
       conf={
           "uri": "http://localhost:19120/iceberg";,
       },
   )
   catalog.create_namespace_if_not_exists("rpmd")
   tables = catalog.list_tables(namespace="rpmd")
   
   table = catalog.create_table_if_not_exists(
       identifier="rpmd.performance", schema=schema, 
partition_spec=partition_spec
   )
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   tbl = pa.Table.from_pandas(data_orig)
   table.append(tbl)
   
   ```
   When i run this script a second time, it is my understanding that the 
`append` method would create a new commit, and in turn, a new snapshot of the 
rpmd.performance iceberg table, and i should be able to see a list of snapshots 
in the latest metadata.json file. However, i only see the latest snapshot in 
the array, and running `table.inspect.snapshots()` only shows 1 snapshot_id, 
even though i see all the relevant .avro files.
   
   May i know if there is some configuration i need to do to ensure that i can 
see all snapshots?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Iceberg table not keeping track of snapshots [iceberg-python]

Reply via email to