gaborkaszab commented on code in PR #14440:
URL: https://github.com/apache/iceberg/pull/14440#discussion_r2571733657
##########
core/src/main/java/org/apache/iceberg/CatalogProperties.java:
##########
@@ -48,6 +48,27 @@ private CatalogProperties() {}
public static final boolean CACHE_CASE_SENSITIVE_DEFAULT = true;
+ /**
+ * Controls policy of caffeine cache
+ *
+ * <p>Behavior of specific values of cache.strategy:
+ *
+ * <ul>
+ * <li>EXPIRE_AFTER_ACCESS - cache entries are never evicted as long as
they are being accessed
+ * frequently
Review Comment:
This definition seems vague somewhat. Frequently is not very specific. Why
not flip the definition and say `cache entry is removed from the cache after a
specified time has passed since its last access or write operation`
##########
core/src/main/java/org/apache/iceberg/CatalogProperties.java:
##########
@@ -48,6 +48,27 @@ private CatalogProperties() {}
public static final boolean CACHE_CASE_SENSITIVE_DEFAULT = true;
+ /**
+ * Controls policy of caffeine cache
Review Comment:
I don't think we should express what Cache implementation we used
internally. Instead of `caffeine cache` we can say `table cache`
##########
core/src/main/java/org/apache/iceberg/CatalogProperties.java:
##########
@@ -48,6 +48,27 @@ private CatalogProperties() {}
public static final boolean CACHE_CASE_SENSITIVE_DEFAULT = true;
+ /**
+ * Controls policy of caffeine cache
+ *
+ * <p>Behavior of specific values of cache.strategy:
+ *
+ * <ul>
+ * <li>EXPIRE_AFTER_ACCESS - cache entries are never evicted as long as
they are being accessed
+ * frequently
+ * <li>EXPIRE_AFTER_WRITE - cache entries are evicted frequently after
cache write
+ * </ul>
+ */
+ public static final String CACHE_EXPIRATION_POLICY =
"cache.expiration-policy";
+
+ public enum CacheExpirationPolicy {
+ EXPIRE_AFTER_ACCESS,
+ EXPIRE_AFTER_WRITE
+ }
+
+ public static final CacheExpirationPolicy CACHE_EXPIRATION_POLICY_DEFAULT =
+ CacheExpirationPolicy.EXPIRE_AFTER_ACCESS;
+
Review Comment:
The comment for `CACHE_EXPIRATION_INTERVAL_MS` seems off now. It mentions
expiration after access but with this patch that's just one use-case.
##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##########
@@ -79,6 +80,27 @@ private SparkSQLProperties() {}
public static final String EXECUTOR_CACHE_ENABLED =
"spark.sql.iceberg.executor-cache.enabled";
public static final boolean EXECUTOR_CACHE_ENABLED_DEFAULT = true;
+ public static final String EXECUTOR_CACHE_EXPIRATION_POLICY =
+ "spark.sql.iceberg.executor-cache.expiration-policy";
+
+ public enum ExecutorCacheExpirationPolicy {
+ EXPIRE_AFTER_ACCESS,
+ EXPIRE_AFTER_WRITE;
+
+ public static ExecutorCacheExpirationPolicy fromName(String name) {
+ Preconditions.checkArgument(
+ name != null, "ExecutorCacheExpirationPolicy reader type is null");
+ try {
+ return ExecutorCacheExpirationPolicy.valueOf(name.toUpperCase());
+ } catch (IllegalArgumentException e) {
+ throw new IllegalArgumentException("Unknown parquet reader type: " +
name);
Review Comment:
Parquet reader type?
##########
docs/docs/spark-configuration.md:
##########
@@ -191,6 +191,7 @@ val spark = SparkSession.builder()
| spark.sql.iceberg.locality.enabled | false
| Report locality information for
Spark task placement on executors
|
| spark.sql.iceberg.executor-cache.enabled | true
| Enables cache for executor-side
(currently used to cache Delete Files)
|
| spark.sql.iceberg.executor-cache.timeout | 10
| Timeout in minutes for executor
cache entries
|
+| spark.sql.iceberg.executor-cache.expiration-policy | EXPIRE_AFTER_ACCESS
| Policy for expiring executor cache
entries: `EXPIRE_AFTER_ACCESS` (default, expires after last access) or
`EXPIRE_AFTER_WRITE` (expires after write, regardless of access). |
| 10 | Timeout in
minutes for executor cache entries
|
Review Comment:
I'm not a huge expert in the Spark part of the project, but for me it seems
that Spark executor cache is a completely different thing. This new property is
meant to configure CachingCatalog instead, so this doc entry seems off for me.
Update: Ahh, I think I get it. You not just adding this policy to
CachingCatalog, but to some other caches within the project. I had the
impression so far that this PR is concentrating only on CachingCatalog. I
believe that we should keep the scope as narrow as possible. My recommendation
is that tackle only one task at a time and start e.g. with CachingCatalog then
a separate PR could cover Spark Executor Cache. The motivations to introduce
such a policy could be different based on the different purpose of these caches.
##########
core/src/main/java/org/apache/iceberg/util/PropertyUtil.java:
##########
@@ -127,6 +129,21 @@ public static String propertyAsString(
return defaultValue;
}
+ public static <E extends Enum<E>> E propertyAsEnum(
Review Comment:
I don't think you need this. Check how SnapshotMode is initialized in
RESTSessionCatalog:
```this.snapshotMode =
SnapshotMode.valueOf(
PropertyUtil.propertyAsString(
mergedProps,
RESTCatalogProperties.SNAPSHOT_LOADING_MODE,
RESTCatalogProperties.SNAPSHOT_LOADING_MODE_DEFAULT)
.toUpperCase(Locale.US));```
##########
docs/docs/configuration.md:
##########
@@ -128,15 +128,16 @@ The value of these properties are not persisted as a part
of the table metadata.
Iceberg catalogs support using catalog properties to configure catalog
behaviors. Here is a list of commonly used catalog properties:
-| Property | Default | Description
|
-| --------------------------------- | ------------------ |
------------------------------------------------------ |
-| catalog-impl | null | a custom `Catalog`
implementation to use by an engine |
-| io-impl | null | a custom `FileIO`
implementation to use in a catalog |
-| warehouse | null | the root path of
the data warehouse |
-| uri | null | a URI string, such
as Hive metastore URI |
-| clients | 2 | client pool size
|
-| cache-enabled | true | Whether to cache
catalog entries |
-| cache.expiration-interval-ms | 30000 | How long catalog
entries are locally cached, in milliseconds; 0 disables caching, negative
values disable expiration |
+| Property | Default | Description
|
Review Comment:
Could you help me what the reason is to change the existing rows in this
table?
##########
.palantir/revapi.yml:
##########
@@ -1456,6 +1456,12 @@ acceptedBreaks:
old: "method void
org.apache.iceberg.PartitionStats::deletedEntry(org.apache.iceberg.Snapshot)"
new: "method void
org.apache.iceberg.PartitionStats::deletedEntry(org.apache.iceberg.Snapshot)"
justification: "Changing deprecated code"
+ - code: "java.method.numberOfParametersChanged"
Review Comment:
Just for the record: The reason this is protected is that
TestableCachingCatalog can use this passing in a mocked Ticker for testing. In
my opinion it should have been annotated with `@VisibleForTesting` and be
package private, but this is how it is now.
Anyway, We shouldn't break API at any circumstances, so as long as this PR
has recapitalized changes like this, there is still some adjustments to do to
keep existing API intact.
Changing the signature of a protected method still breaks the API, because
there could be applications inheriting from this class and using the protected
methods. They would break with this PR.
##########
core/src/main/java/org/apache/iceberg/CatalogProperties.java:
##########
@@ -48,6 +48,27 @@ private CatalogProperties() {}
public static final boolean CACHE_CASE_SENSITIVE_DEFAULT = true;
+ /**
+ * Controls policy of caffeine cache
+ *
+ * <p>Behavior of specific values of cache.strategy:
+ *
+ * <ul>
+ * <li>EXPIRE_AFTER_ACCESS - cache entries are never evicted as long as
they are being accessed
+ * frequently
+ * <li>EXPIRE_AFTER_WRITE - cache entries are evicted frequently after
cache write
Review Comment:
`cache entries are evicted frequently` I'm not sure what this means TBH.
They are evicted after a configured interval. If it's configured short, then
frequently, if not then not.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]