amogh-jahagirdar commented on code in PR #245:
URL: https://github.com/apache/iceberg-python/pull/245#discussion_r1502013528


##########
tests/catalog/test_hive.py:
##########
@@ -277,7 +277,7 @@ def test_create_table(table_schema_simple: Schema, 
hive_database: HiveDatabase,
             )
         ],
         current_schema_id=0,
-        last_partition_id=1000,
+        last_partition_id=999,

Review Comment:
   @Fokko I took a look and integrated the changes, and after going back/forth 
I think I'd like to keep the changes as is.
   My rationale is that even after those changes there's some more workarounds 
that need to be done to make sure new IDs start at 1000 (after taking the 
changes directly, field IDs for non-REST will start at 1001 without any more 
changes).
   
   Now, technically it does not seem to be a hard requirement that ids need to 
start at 1000 for v2 tables. Even for v1 starting at 1000 does not seem to be a 
requirement, rather just how the Java lib was implemented.
   
   ```
   In v1, partition field IDs were not tracked, but were assigned sequentially 
starting at 1000 in the reference implementation.
   ```
   I read this as "this was how we originally implemented in Java but not 
really required so long as it's unique and increasing for new fields"
   
   All that said, I'd advocate for just following the practice of starting at 
1000 for both v1 and v2.
   
   On returning 999 for unpartitioned spec:
   
   As @HonahX alluded to I think that makes for a logical API, considering we 
want to start a field at 1000, the unpartitioned spec (no fields) last assigned 
field ID should be one less than that. I don't think we want to return 1000 in 
that case. This is also what Spark sets for the unpartitioned spec, and is spec 
compliant (since the spec doesn't mandate any particular IDs)
   
   I could see the argument where we just want to return None for a 
last_assigned_partition_id for this API but that just shifts the responsibility 
to other locations to make sure Ids are set correctly according to our scheme 
which ends up being more messy imo compared to just defining the API to return 
999 if the only spec is the unpartitioned spec (which is the value which would 
be set anyways).
   
   I also think it makes sense to set last-assigned-partition-id for both V1 
and V2. Even though it's optional for V1 we set the other optional fields for 
v1 metadata so it seems a bit odd to make this particular case an exception.



##########
tests/integration/test_partition_evolution.py:
##########
@@ -0,0 +1,490 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint:disable=redefined-outer-name
+
+import pytest
+
+from pyiceberg.catalog import Catalog, load_catalog
+from pyiceberg.exceptions import NoSuchTableError
+from pyiceberg.partitioning import PartitionField, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.table import Table
+from pyiceberg.transforms import (
+    BucketTransform,
+    DayTransform,
+    HourTransform,
+    IdentityTransform,
+    MonthTransform,
+    TruncateTransform,
+    VoidTransform,
+    YearTransform,
+)
+from pyiceberg.types import (
+    LongType,
+    NestedField,
+    StringType,
+    TimestampType,
+)
+
+
+@pytest.fixture()
+def catalog_rest() -> Catalog:
+    return load_catalog(
+        "local",
+        **{
+            "type": "rest",
+            "uri": "http://localhost:8181";,
+            "s3.endpoint": "http://localhost:9000";,
+            "s3.access-key-id": "admin",
+            "s3.secret-access-key": "password",
+        },
+    )
+
+
+@pytest.fixture()
+def catalog_hive() -> Catalog:
+    return load_catalog(
+        "local",
+        **{
+            "type": "hive",
+            "uri": "http://localhost:9083";,
+            "s3.endpoint": "http://localhost:9000";,
+            "s3.access-key-id": "admin",
+            "s3.secret-access-key": "password",
+        },
+    )
+
+
+def _simple_table(catalog: Catalog, table_schema_simple: Schema) -> Table:
+    return _create_table_with_schema(catalog, table_schema_simple, "1")
+
+
+def _table(catalog: Catalog) -> Table:
+    schema_with_timestamp = Schema(
+        NestedField(1, "id", LongType(), required=False),
+        NestedField(2, "event_ts", TimestampType(), required=False),
+        NestedField(3, "str", StringType(), required=False),
+    )
+    return _create_table_with_schema(catalog, schema_with_timestamp, "1")
+
+
+def _table_v2(catalog: Catalog) -> Table:
+    schema_with_timestamp = Schema(
+        NestedField(1, "id", LongType(), required=False),
+        NestedField(2, "event_ts", TimestampType(), required=False),
+        NestedField(3, "str", StringType(), required=False),
+    )
+    return _create_table_with_schema(catalog, schema_with_timestamp, "2")
+
+
+def _create_table_with_schema(catalog: Catalog, schema: Schema, 
format_version: str) -> Table:
+    tbl_name = "default.test_schema_evolution"
+    try:
+        catalog.drop_table(tbl_name)
+    except NoSuchTableError:
+        pass
+    return catalog.create_table(identifier=tbl_name, schema=schema, 
properties={"format-version": format_version})
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_add_identity_partition(catalog: Catalog, table_schema_simple: Schema) 
-> None:
+    simple_table = _simple_table(catalog, table_schema_simple)
+    simple_table.update_spec().add_identity("foo").commit()
+    specs = simple_table.specs()
+    assert len(specs) == 2
+    spec = simple_table.spec()
+    assert spec.spec_id == 1
+    assert spec.last_assigned_field_id == 1000
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_add_year(catalog: Catalog) -> None:
+    table = _table(catalog)
+    table.update_spec().add_field("event_ts", YearTransform(), 
"year_transform").commit()
+    _validate_new_partition_fields(table, 1000, 1, 1000, PartitionField(2, 
1000, YearTransform(), "year_transform"))
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_add_month(catalog: Catalog) -> None:
+    table = _table(catalog)
+    table.update_spec().add_field("event_ts", MonthTransform(), 
"month_transform").commit()
+    _validate_new_partition_fields(table, 1000, 1, 1000, PartitionField(2, 
1000, MonthTransform(), "month_transform"))
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_add_day(catalog: Catalog) -> None:
+    table = _table(catalog)
+    table.update_spec().add_field("event_ts", DayTransform(), 
"day_transform").commit()
+    _validate_new_partition_fields(table, 1000, 1, 1000, PartitionField(2, 
1000, DayTransform(), "day_transform"))
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_add_hour(catalog: Catalog) -> None:
+    table = _table(catalog)
+    table.update_spec().add_field("event_ts", HourTransform(), 
"hour_transform").commit()
+    _validate_new_partition_fields(table, 1000, 1, 1000, PartitionField(2, 
1000, HourTransform(), "hour_transform"))
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_add_bucket(catalog: Catalog, table_schema_simple: Schema) -> None:
+    simple_table = _create_table_with_schema(catalog, table_schema_simple, "1")
+    simple_table.update_spec().add_field("foo", BucketTransform(12), 
"bucket_transform").commit()
+    _validate_new_partition_fields(simple_table, 1000, 1, 1000, 
PartitionField(1, 1000, BucketTransform(12), "bucket_transform"))
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_add_truncate(catalog: Catalog, table_schema_simple: Schema) -> None:
+    simple_table = _create_table_with_schema(catalog, table_schema_simple, "1")
+    simple_table.update_spec().add_field("foo", TruncateTransform(1), 
"truncate_transform").commit()
+    _validate_new_partition_fields(
+        simple_table, 1000, 1, 1000, PartitionField(1, 1000, 
TruncateTransform(1), "truncate_transform")
+    )
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_multiple_adds(catalog: Catalog) -> None:
+    table = _table(catalog)
+    table.update_spec().add_identity("id").add_field("event_ts", 
HourTransform(), "hourly_partitioned").add_field(
+        "str", TruncateTransform(2), "truncate_str"
+    ).commit()
+    _validate_new_partition_fields(
+        table,
+        1002,
+        1,
+        1002,
+        PartitionField(1, 1000, IdentityTransform(), "id"),
+        PartitionField(2, 1001, HourTransform(), "hourly_partitioned"),
+        PartitionField(3, 1002, TruncateTransform(2), "truncate_str"),
+    )
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_add_hour_to_day(catalog: Catalog) -> None:
+    table = _table(catalog)
+    table.update_spec().add_field("event_ts", DayTransform(), 
"daily_partitioned").commit()
+    table.update_spec().add_field("event_ts", HourTransform(), 
"hourly_partitioned").commit()
+    _validate_new_partition_fields(
+        table,
+        1001,
+        2,
+        1001,
+        PartitionField(2, 1000, DayTransform(), "daily_partitioned"),
+        PartitionField(2, 1001, HourTransform(), "hourly_partitioned"),
+    )
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_add_multiple_buckets(catalog: Catalog) -> None:
+    table = _table(catalog)
+    table.update_spec().add_field("id", BucketTransform(16)).add_field("id", 
BucketTransform(4)).commit()
+    _validate_new_partition_fields(
+        table,
+        1001,
+        1,
+        1001,
+        PartitionField(1, 1000, BucketTransform(16), "id_bucket_16"),
+        PartitionField(1, 1001, BucketTransform(4), "id_bucket_4"),
+    )
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_remove_identity(catalog: Catalog) -> None:
+    table = _table(catalog)
+    table.update_spec().add_identity("id").commit()
+    table.update_spec().remove_field("id").commit()
+    assert len(table.specs()) == 3
+    assert table.spec().spec_id == 2
+    assert table.spec() == PartitionSpec(
+        PartitionField(source_id=1, field_id=1000, transform=VoidTransform(), 
name='id'), spec_id=2
+    )
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_remove_identity_v2(catalog: Catalog) -> None:
+    table_v2 = _table_v2(catalog)
+    table_v2.update_spec().add_identity("id").commit()
+    table_v2.update_spec().remove_field("id").commit()
+    assert len(table_v2.specs()) == 2
+    assert table_v2.spec().spec_id == 0
+    assert table_v2.spec() == PartitionSpec(spec_id=0)
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_remove_bucket(catalog: Catalog) -> None:
+    table = _table(catalog)
+    with table.update_spec() as update:
+        update.add_field("id", BucketTransform(16), "bucketed_id")
+        update.add_field("event_ts", DayTransform(), "day_ts")
+    with table.update_spec() as remove:
+        remove.remove_field("bucketed_id")
+
+    assert len(table.specs()) == 3
+    _validate_new_partition_fields(
+        table,
+        1001,
+        2,
+        1001,
+        PartitionField(source_id=1, field_id=1000, transform=VoidTransform(), 
name='bucketed_id'),
+        PartitionField(source_id=2, field_id=1001, transform=DayTransform(), 
name='day_ts'),
+    )
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_remove_bucket_v2(catalog: Catalog) -> None:
+    table_v2 = _table_v2(catalog)
+    with table_v2.update_spec() as update:
+        update.add_field("id", BucketTransform(16), "bucketed_id")
+        update.add_field("event_ts", DayTransform(), "day_ts")
+    with table_v2.update_spec() as remove:
+        remove.remove_field("bucketed_id")
+    assert len(table_v2.specs()) == 3
+    _validate_new_partition_fields(
+        table_v2, 1001, 2, 1001, PartitionField(source_id=2, field_id=1001, 
transform=DayTransform(), name='day_ts')
+    )
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_remove_day(catalog: Catalog) -> None:
+    table = _table(catalog)
+    with table.update_spec() as update:
+        update.add_field("id", BucketTransform(16), "bucketed_id")
+        update.add_field("event_ts", DayTransform(), "day_ts")
+    with table.update_spec() as remove:
+        remove.remove_field("day_ts")
+
+    assert len(table.specs()) == 3
+    _validate_new_partition_fields(
+        table,
+        1001,
+        2,
+        1001,
+        PartitionField(source_id=1, field_id=1000, 
transform=BucketTransform(16), name='bucketed_id'),
+        PartitionField(source_id=2, field_id=1001, transform=VoidTransform(), 
name='day_ts'),
+    )
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])
+def test_remove_day_v2(catalog: Catalog) -> None:
+    table_v2 = _table_v2(catalog)
+    with table_v2.update_spec() as update:
+        update.add_field("id", BucketTransform(16), "bucketed_id")
+        update.add_field("event_ts", DayTransform(), "day_ts")
+    with table_v2.update_spec() as remove:
+        remove.remove_field("day_ts")
+    assert len(table_v2.specs()) == 3
+    _validate_new_partition_fields(
+        table_v2, 1000, 2, 1001, PartitionField(source_id=1, field_id=1000, 
transform=BucketTransform(16), name='bucketed_id')
+    )
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), 
pytest.lazy_fixture('catalog_rest')])

Review Comment:
   I've parameterized all the tests for testing both hive/rest.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to