PetrasTYR opened a new issue, #1105:
URL: https://github.com/apache/iceberg-python/issues/1105
### Question
Hello, I have a question regarding iceberg table snapshots. I used pyiceberg
to create a namespace and a table, then insert a dataframe like so:
```
from pyiceberg.catalog import load_rest
from pyiceberg.schema import Schema
from pyiceberg.types import StringType, NestedField, DoubleType
from pyiceberg.partitioning import PartitionSpec, PartitionField
import numpy as np
import pandas as pd
rows = 10**1
ncols = 10
countries = ["US", "CA", "UK"]
attr2 = ["OPEN", "CLOSE", "LOW", "HIGH"]
dates = pd.date_range("2020-01-01", "2020-12-31")
data_orig = pd.DataFrame(
[
{
"countries": countries[i % len(countries)],
"status": attr2[i % len(attr2)],
"return_index": np.random.rand(),
}
for i in range(rows)
]
)
schema = Schema(
NestedField(field_id=1, name="countries", field_type=StringType(),
required=False),
NestedField(field_id=2, name="status", field_type=StringType(),
required=False),
NestedField(
field_id=3, name="return_index", field_type=DoubleType(),
required=False
),
)
partition_spec = PartitionSpec(
fields=[
PartitionField(
source_id=1, field_id=1000, name="countries",
transform="identity"
),
PartitionField(source_id=2, field_id=1001, name="status",
transform="identity"),
]
)
catalog = load_rest(
"rest",
conf={
"uri": "http://localhost:19120/iceberg",
},
)
catalog.create_namespace_if_not_exists("rpmd")
tables = catalog.list_tables(namespace="rpmd")
table = catalog.create_table_if_not_exists(
identifier="rpmd.performance", schema=schema,
partition_spec=partition_spec
)
import pyarrow as pa
import pyarrow.parquet as pq
tbl = pa.Table.from_pandas(data_orig)
table.append(tbl)
```
When i run this script a second time, it is my understanding that the
`append` method would create a new commit, and in turn, a new snapshot of the
rpmd.performance iceberg table, and i should be able to see a list of snapshots
in the latest metadata.json file. However, i only see the latest snapshot in
the array, and running `table.inspect.snapshots()` only shows 1 snapshot_id,
even though i see all the relevant .avro files.
May i know if there is some configuration i need to do to ensure that i can
see all snapshots?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]