dennishuo commented on code in PR #1339:
URL: https://github.com/apache/polaris/pull/1339#discussion_r2072293174


##########
polaris-core/src/main/java/org/apache/polaris/core/persistence/cache/InMemoryEntityCache.java:
##########
@@ -267,104 +189,70 @@ public void setCacheMode(EntityCacheMode cacheMode) {
       @Nonnull PolarisBaseEntity entityToValidate,
       int entityMinVersion,
       int entityGrantRecordsMinVersion) {
-    long entityCatalogId = entityToValidate.getCatalogId();
     long entityId = entityToValidate.getId();
-    PolarisEntityType entityType = entityToValidate.getType();
 
-    // first lookup the cache to find the existing cache entry
+    // First lookup the cache to find the existing cache entry
     ResolvedPolarisEntity existingCacheEntry = this.getEntityById(entityId);
 
-    // the caller's fetched entity may have come from a stale lookup byName; 
we should consider
-    // the existingCacheEntry to be the older of the two for purposes of 
invalidation to make
-    // sure when we replaceCacheEntry we're also removing the old name if it's 
no longer valid
-    EntityCacheByNameKey nameKey = new EntityCacheByNameKey(entityToValidate);
-    ResolvedPolarisEntity existingCacheEntryByName = 
this.getEntityByName(nameKey);
-    if (existingCacheEntryByName != null
-        && existingCacheEntry != null
-        && isNewer(existingCacheEntry, existingCacheEntryByName)) {
-      existingCacheEntry = existingCacheEntryByName;
+    // See if we need to load or refresh that entity
+    if (existingCacheEntry != null
+        && existingCacheEntry.getEntity().getEntityVersion() >= 
entityMinVersion
+        && existingCacheEntry.getEntity().getGrantRecordsVersion()
+            >= entityGrantRecordsMinVersion) {
+      // Cache hit and cache entry is up to date, nothing to do.
+      return existingCacheEntry;
     }
 
-    // the new one to be returned
-    final ResolvedPolarisEntity newCacheEntry;
-
-    // see if we need to load or refresh that entity
-    if (existingCacheEntry == null
-        || existingCacheEntry.getEntity().getEntityVersion() < entityMinVersion
-        || existingCacheEntry.getEntity().getGrantRecordsVersion() < 
entityGrantRecordsMinVersion) {
-
-      // the refreshed entity
-      final ResolvedEntityResult refreshedCacheEntry;
-
-      // was not found in the cache?
-      final PolarisBaseEntity entity;
-      final List<PolarisGrantRecord> grantRecords;
-      final int grantRecordsVersion;
-      if (existingCacheEntry == null) {
-        // try to load it
-        refreshedCacheEntry =
-            this.polarisMetaStoreManager.loadResolvedEntityById(
-                callContext, entityCatalogId, entityId, entityType);
-        if (refreshedCacheEntry.isSuccess()) {
-          entity = refreshedCacheEntry.getEntity();
-          grantRecords = refreshedCacheEntry.getEntityGrantRecords();
-          grantRecordsVersion = refreshedCacheEntry.getGrantRecordsVersion();
-        } else {
-          return null;
-        }
-      } else {
-        // refresh it
-        refreshedCacheEntry =
-            this.polarisMetaStoreManager.refreshResolvedEntity(
-                callContext,
-                existingCacheEntry.getEntity().getEntityVersion(),
-                existingCacheEntry.getEntity().getGrantRecordsVersion(),
-                entityType,
-                entityCatalogId,
-                entityId);
-        if (refreshedCacheEntry.isSuccess()) {
-          entity =
-              (refreshedCacheEntry.getEntity() != null)
-                  ? refreshedCacheEntry.getEntity()
-                  : existingCacheEntry.getEntity();
-          if (refreshedCacheEntry.getEntityGrantRecords() != null) {
-            grantRecords = refreshedCacheEntry.getEntityGrantRecords();
-            grantRecordsVersion = refreshedCacheEntry.getGrantRecordsVersion();
-          } else {
-            grantRecords = existingCacheEntry.getAllGrantRecords();
-            grantRecordsVersion = 
existingCacheEntry.getEntity().getGrantRecordsVersion();
-          }
-        } else {
-          // entity has been purged, remove it
-          this.removeCacheEntry(existingCacheEntry);
-          return null;
-        }
+    // Either cache miss, dropped entity or stale entity. In either case, 
invalidate and reload it.
+    // Do the refresh in a critical section based on the entity ID to avoid 
race conditions.
+    lock.writeLock().lock();
+    try {
+      // Lookup the cache again in case another thread has already invalidated 
it.
+      existingCacheEntry = this.getEntityById(entityId);
+
+      // See if the entity has been refreshed concurrently
+      if (existingCacheEntry != null
+          && existingCacheEntry.getEntity().getEntityVersion() >= 
entityMinVersion
+          && existingCacheEntry.getEntity().getGrantRecordsVersion()
+              >= entityGrantRecordsMinVersion) {
+        // Cache hit and cache entry is up to date now, exit
+        return existingCacheEntry;
+      }
+
+      // Either confirmed cache miss, or the entity is stale. Invalidate it 
just in case.
+      cache.invalidate(entityId);
+
+      // Get the entity from the cache or reload it now that it has been 
invalidated
+      EntityCacheLookupResult cacheLookupResult =
+          this.maybeLoadEntityById(
+              callContext, entityToValidate.getCatalogId(), entityId, 
entityToValidate.getType());
+      if (cacheLookupResult == null) {
+        // Entity has been purged, and we have already cleaned the cache and 
the name-to-id mapping,
+        // nothing else to do.
+        return null;
       }
 
-      // assert that entity, grant records and version are all set
-      callContext.getDiagServices().checkNotNull(entity, 
"unexpected_null_entity");
-      callContext.getDiagServices().checkNotNull(grantRecords, 
"unexpected_null_grant_records");
+      ResolvedPolarisEntity entity = cacheLookupResult.getCacheEntry();

Review Comment:
   It looks like we got rid of the codepath that optimizes "only loading what 
changed" via `refreshResolvedEntity` here. I agree that codepath is kind of 
gnarly and could be argued to be over-optimization, but it's indeed a 
behavioral change that we might want to be careful about.
   
   In particular, the case that it optimizes for is:
   
   1. An entity's main contents are frequently updated
   2. That entity's grants only rarely change
   3. That entity has lots of grants associated with it
   4. Grants might be more expensive to load because they are a range scan 
rather than a point-lookup
   
   With `refreshResolvedEntity`, the cache will eliminate all load on the 
"grants lookup" index in the database, even if the entity itself is fast 
changing. For a single entity this isn't a huge deal probably, but if there's 
100,000 fast-changing entities in a system, the blowup of grant re-lookup could 
be noticeable.
   
   If we do make this entire cache implementation just one option out of 
several, I guess it's fine to wait until the performance issue actually comes 
up to re-add the complexity (or allow the service owner to select the other 
cache implementation).



##########
polaris-core/src/main/java/org/apache/polaris/core/persistence/cache/InMemoryEntityCache.java:
##########
@@ -20,194 +20,99 @@
 
 import com.github.benmanes.caffeine.cache.Cache;
 import com.github.benmanes.caffeine.cache.Caffeine;
-import com.github.benmanes.caffeine.cache.RemovalListener;
+import com.github.benmanes.caffeine.cache.RemovalCause;
+import com.google.common.collect.HashBiMap;
+import com.google.common.util.concurrent.MoreExecutors;
 import jakarta.annotation.Nonnull;
 import jakarta.annotation.Nullable;
-import java.util.AbstractMap;
-import java.util.List;
-import java.util.concurrent.ConcurrentHashMap;
 import java.util.concurrent.TimeUnit;
+import java.util.concurrent.locks.ReentrantReadWriteLock;
+import net.jcip.annotations.GuardedBy;
 import org.apache.polaris.core.PolarisCallContext;
 import org.apache.polaris.core.config.BehaviorChangeConfiguration;
 import org.apache.polaris.core.config.FeatureConfiguration;
 import org.apache.polaris.core.config.PolarisConfiguration;
 import org.apache.polaris.core.entity.PolarisBaseEntity;
 import org.apache.polaris.core.entity.PolarisEntityType;
-import org.apache.polaris.core.entity.PolarisGrantRecord;
 import org.apache.polaris.core.persistence.PolarisMetaStoreManager;
 import org.apache.polaris.core.persistence.ResolvedPolarisEntity;
 import org.apache.polaris.core.persistence.dao.entity.ResolvedEntityResult;
 
-/** An in-memory entity cache with a limit of 100k entities and a 1h TTL. */
+/** The entity cache, can be private or shared */
 public class InMemoryEntityCache implements EntityCache {
-
   // cache mode
   private EntityCacheMode cacheMode;
 
   // the meta store manager
   private final PolarisMetaStoreManager polarisMetaStoreManager;
 
-  // Caffeine cache to keep entries by id
-  private final Cache<Long, ResolvedPolarisEntity> byId;
+  // Caffeine cache to keep entries
+  @GuardedBy("lock")
+  private final Cache<Long, ResolvedPolarisEntity> cache;
+
+  // Map of entity names to entity ids
+  @GuardedBy("lock")
+  private final HashBiMap<Long, EntityCacheByNameKey> nameToIdMap;
 
-  // index by name
-  private final AbstractMap<EntityCacheByNameKey, ResolvedPolarisEntity> 
byName;
+  // Locks to ensure that an entity can only be refreshed by one thread at a 
time
+  private final ReentrantReadWriteLock lock;
 
   /**
    * Constructor. Cache can be private or shared
    *
    * @param polarisMetaStoreManager the meta store manager implementation
    */
   public InMemoryEntityCache(@Nonnull PolarisMetaStoreManager 
polarisMetaStoreManager) {
+    // remember the meta store manager
+    this.polarisMetaStoreManager = polarisMetaStoreManager;
 
-    // by name cache
-    this.byName = new ConcurrentHashMap<>();
-
-    // When an entry is removed, we simply remove it from the byName map
-    RemovalListener<Long, ResolvedPolarisEntity> removalListener =
-        (key, value, cause) -> {
-          if (value != null) {
-            // compute name key
-            EntityCacheByNameKey nameKey = new 
EntityCacheByNameKey(value.getEntity());
+    // enabled by default
+    this.cacheMode = EntityCacheMode.ENABLE;
 
-            // if it is still active, remove it from the name key
-            this.byName.remove(nameKey, value);
-          }
-        };
+    this.nameToIdMap = HashBiMap.create();
+    this.lock = new ReentrantReadWriteLock();
 
     long weigherTarget =
         
PolarisConfiguration.loadConfig(FeatureConfiguration.ENTITY_CACHE_WEIGHER_TARGET);
-    Caffeine<Long, ResolvedPolarisEntity> byIdBuilder =
+    Caffeine<Long, ResolvedPolarisEntity> cacheBuilder =
         Caffeine.newBuilder()
             .maximumWeight(weigherTarget)
             .weigher(EntityWeigher.asWeigher())
             .expireAfterAccess(1, TimeUnit.HOURS) // Expire entries after 1 
hour of no access
-            .removalListener(removalListener); // Set the removal listener
+            // Add a removal listener so that the name-to-id mapping is also 
purged after every
+            // cache
+            // replacement, invalidation or expiration .  Configure a direct 
executor so that the
+            // removal
+            // listener is invoked by the same thread that is updating the 
cache.

Review Comment:
   Is there a window of time between when the native removal happens (without 
holding any writeLock) and when the listener runs where the two mappings get 
out of sync?



##########
polaris-core/src/main/java/org/apache/polaris/core/persistence/cache/InMemoryEntityCache.java:
##########
@@ -267,104 +189,70 @@ public void setCacheMode(EntityCacheMode cacheMode) {
       @Nonnull PolarisBaseEntity entityToValidate,
       int entityMinVersion,
       int entityGrantRecordsMinVersion) {
-    long entityCatalogId = entityToValidate.getCatalogId();
     long entityId = entityToValidate.getId();
-    PolarisEntityType entityType = entityToValidate.getType();
 
-    // first lookup the cache to find the existing cache entry
+    // First lookup the cache to find the existing cache entry
     ResolvedPolarisEntity existingCacheEntry = this.getEntityById(entityId);
 
-    // the caller's fetched entity may have come from a stale lookup byName; 
we should consider
-    // the existingCacheEntry to be the older of the two for purposes of 
invalidation to make
-    // sure when we replaceCacheEntry we're also removing the old name if it's 
no longer valid
-    EntityCacheByNameKey nameKey = new EntityCacheByNameKey(entityToValidate);
-    ResolvedPolarisEntity existingCacheEntryByName = 
this.getEntityByName(nameKey);
-    if (existingCacheEntryByName != null
-        && existingCacheEntry != null
-        && isNewer(existingCacheEntry, existingCacheEntryByName)) {
-      existingCacheEntry = existingCacheEntryByName;
+    // See if we need to load or refresh that entity
+    if (existingCacheEntry != null
+        && existingCacheEntry.getEntity().getEntityVersion() >= 
entityMinVersion
+        && existingCacheEntry.getEntity().getGrantRecordsVersion()
+            >= entityGrantRecordsMinVersion) {
+      // Cache hit and cache entry is up to date, nothing to do.
+      return existingCacheEntry;
     }
 
-    // the new one to be returned
-    final ResolvedPolarisEntity newCacheEntry;
-
-    // see if we need to load or refresh that entity
-    if (existingCacheEntry == null
-        || existingCacheEntry.getEntity().getEntityVersion() < entityMinVersion
-        || existingCacheEntry.getEntity().getGrantRecordsVersion() < 
entityGrantRecordsMinVersion) {
-
-      // the refreshed entity
-      final ResolvedEntityResult refreshedCacheEntry;
-
-      // was not found in the cache?
-      final PolarisBaseEntity entity;
-      final List<PolarisGrantRecord> grantRecords;
-      final int grantRecordsVersion;
-      if (existingCacheEntry == null) {
-        // try to load it
-        refreshedCacheEntry =
-            this.polarisMetaStoreManager.loadResolvedEntityById(
-                callContext, entityCatalogId, entityId, entityType);
-        if (refreshedCacheEntry.isSuccess()) {
-          entity = refreshedCacheEntry.getEntity();
-          grantRecords = refreshedCacheEntry.getEntityGrantRecords();
-          grantRecordsVersion = refreshedCacheEntry.getGrantRecordsVersion();
-        } else {
-          return null;
-        }
-      } else {
-        // refresh it
-        refreshedCacheEntry =
-            this.polarisMetaStoreManager.refreshResolvedEntity(
-                callContext,
-                existingCacheEntry.getEntity().getEntityVersion(),
-                existingCacheEntry.getEntity().getGrantRecordsVersion(),
-                entityType,
-                entityCatalogId,
-                entityId);
-        if (refreshedCacheEntry.isSuccess()) {
-          entity =
-              (refreshedCacheEntry.getEntity() != null)
-                  ? refreshedCacheEntry.getEntity()
-                  : existingCacheEntry.getEntity();
-          if (refreshedCacheEntry.getEntityGrantRecords() != null) {
-            grantRecords = refreshedCacheEntry.getEntityGrantRecords();
-            grantRecordsVersion = refreshedCacheEntry.getGrantRecordsVersion();
-          } else {
-            grantRecords = existingCacheEntry.getAllGrantRecords();
-            grantRecordsVersion = 
existingCacheEntry.getEntity().getGrantRecordsVersion();
-          }
-        } else {
-          // entity has been purged, remove it
-          this.removeCacheEntry(existingCacheEntry);
-          return null;
-        }
+    // Either cache miss, dropped entity or stale entity. In either case, 
invalidate and reload it.
+    // Do the refresh in a critical section based on the entity ID to avoid 
race conditions.
+    lock.writeLock().lock();
+    try {
+      // Lookup the cache again in case another thread has already invalidated 
it.
+      existingCacheEntry = this.getEntityById(entityId);
+
+      // See if the entity has been refreshed concurrently
+      if (existingCacheEntry != null
+          && existingCacheEntry.getEntity().getEntityVersion() >= 
entityMinVersion
+          && existingCacheEntry.getEntity().getGrantRecordsVersion()
+              >= entityGrantRecordsMinVersion) {
+        // Cache hit and cache entry is up to date now, exit
+        return existingCacheEntry;
+      }
+
+      // Either confirmed cache miss, or the entity is stale. Invalidate it 
just in case.
+      cache.invalidate(entityId);
+
+      // Get the entity from the cache or reload it now that it has been 
invalidated
+      EntityCacheLookupResult cacheLookupResult =
+          this.maybeLoadEntityById(
+              callContext, entityToValidate.getCatalogId(), entityId, 
entityToValidate.getType());
+      if (cacheLookupResult == null) {
+        // Entity has been purged, and we have already cleaned the cache and 
the name-to-id mapping,

Review Comment:
   Does this `already cleaned the cache and the name-to-id mapping` refer to 
line 223 where we called `cache.invalidate`? Where does the by-name map get 
cleaned here? Should line 223 call `this.remove` instead?
   
   Also, perhaps reiterate in this comment the fact that this was done within 
the write-locked block which is why we don't have to think more deeply about 
concurrent invalidate vs cacheNewEntry attempts, e.g. `...and we have already 
cleaned the cache and the name-to-id mapping while holding the writeLock,...`
   
   



##########
polaris-core/src/main/java/org/apache/polaris/core/persistence/cache/InMemoryEntityCache.java:
##########
@@ -378,50 +266,83 @@ && isNewer(existingCacheEntry, existingCacheEntryByName)) 
{
       long entityCatalogId,
       long entityId,
       PolarisEntityType entityType) {
+    ResolvedPolarisEntity cachedEntity = getEntityById(entityId);
+    if (cachedEntity != null) {
+      // Cache hit, nothing more to do
+      return new EntityCacheLookupResult(cachedEntity, true);
+    }
 
-    // if it exists, we are set
-    ResolvedPolarisEntity entry = this.getEntityById(entityId);
-    final boolean cacheHit;
+    // Cache miss, we may have to load the entity from the metastore
+    return maybeLoadEntityById(callContext, entityCatalogId, entityId, 
entityType);
+  }
 
-    // we need to load it if it does not exist
-    if (entry == null) {
-      // this is a miss
-      cacheHit = false;
+  private EntityCacheLookupResult maybeLoadEntityById(
+      PolarisCallContext callContext,
+      long entityCatalogId,
+      long entityId,
+      PolarisEntityType entityType) {
+    lock.writeLock().lock();
+    try {
+      ResolvedPolarisEntity cachedEntity = getEntityById(entityId);
+      if (cachedEntity != null) {
+        // Cache was populated concurrently, nothing more to do, exit the 
critical section as
+        // quickly as possible
+        return new EntityCacheLookupResult(cachedEntity, true);
+      }
 
-      // load it
       ResolvedEntityResult result =
           polarisMetaStoreManager.loadResolvedEntityById(
               callContext, entityCatalogId, entityId, entityType);
-
-      // not found, exit
       if (!result.isSuccess()) {
+        // Entity not found, and it was not in the cache, nothing else to do

Review Comment:
   Although this is already a latent issue, could you also update the comment 
here with a TODO to also check that the `result.getReturnStatus()` is indeed 
`ENTITY_NOT_FOUND`? We might also want a github issue to audit all the 
callsites of `isSuccess()` to find situations where a caller assumes the only 
possible non-success is a particular type of failure (e.g. assuming not-found 
or assuming conflicting-update, etc., when it could've been something else)



##########
polaris-core/src/main/java/org/apache/polaris/core/persistence/cache/InMemoryEntityCache.java:
##########
@@ -267,104 +189,70 @@ public void setCacheMode(EntityCacheMode cacheMode) {
       @Nonnull PolarisBaseEntity entityToValidate,
       int entityMinVersion,
       int entityGrantRecordsMinVersion) {
-    long entityCatalogId = entityToValidate.getCatalogId();
     long entityId = entityToValidate.getId();
-    PolarisEntityType entityType = entityToValidate.getType();
 
-    // first lookup the cache to find the existing cache entry
+    // First lookup the cache to find the existing cache entry
     ResolvedPolarisEntity existingCacheEntry = this.getEntityById(entityId);
 
-    // the caller's fetched entity may have come from a stale lookup byName; 
we should consider
-    // the existingCacheEntry to be the older of the two for purposes of 
invalidation to make
-    // sure when we replaceCacheEntry we're also removing the old name if it's 
no longer valid
-    EntityCacheByNameKey nameKey = new EntityCacheByNameKey(entityToValidate);
-    ResolvedPolarisEntity existingCacheEntryByName = 
this.getEntityByName(nameKey);
-    if (existingCacheEntryByName != null
-        && existingCacheEntry != null
-        && isNewer(existingCacheEntry, existingCacheEntryByName)) {
-      existingCacheEntry = existingCacheEntryByName;
+    // See if we need to load or refresh that entity
+    if (existingCacheEntry != null
+        && existingCacheEntry.getEntity().getEntityVersion() >= 
entityMinVersion
+        && existingCacheEntry.getEntity().getGrantRecordsVersion()
+            >= entityGrantRecordsMinVersion) {
+      // Cache hit and cache entry is up to date, nothing to do.
+      return existingCacheEntry;
     }
 
-    // the new one to be returned
-    final ResolvedPolarisEntity newCacheEntry;
-
-    // see if we need to load or refresh that entity
-    if (existingCacheEntry == null
-        || existingCacheEntry.getEntity().getEntityVersion() < entityMinVersion
-        || existingCacheEntry.getEntity().getGrantRecordsVersion() < 
entityGrantRecordsMinVersion) {
-
-      // the refreshed entity
-      final ResolvedEntityResult refreshedCacheEntry;
-
-      // was not found in the cache?
-      final PolarisBaseEntity entity;
-      final List<PolarisGrantRecord> grantRecords;
-      final int grantRecordsVersion;
-      if (existingCacheEntry == null) {
-        // try to load it
-        refreshedCacheEntry =
-            this.polarisMetaStoreManager.loadResolvedEntityById(
-                callContext, entityCatalogId, entityId, entityType);
-        if (refreshedCacheEntry.isSuccess()) {
-          entity = refreshedCacheEntry.getEntity();
-          grantRecords = refreshedCacheEntry.getEntityGrantRecords();
-          grantRecordsVersion = refreshedCacheEntry.getGrantRecordsVersion();
-        } else {
-          return null;
-        }
-      } else {
-        // refresh it
-        refreshedCacheEntry =
-            this.polarisMetaStoreManager.refreshResolvedEntity(
-                callContext,
-                existingCacheEntry.getEntity().getEntityVersion(),
-                existingCacheEntry.getEntity().getGrantRecordsVersion(),
-                entityType,
-                entityCatalogId,
-                entityId);
-        if (refreshedCacheEntry.isSuccess()) {
-          entity =
-              (refreshedCacheEntry.getEntity() != null)
-                  ? refreshedCacheEntry.getEntity()
-                  : existingCacheEntry.getEntity();
-          if (refreshedCacheEntry.getEntityGrantRecords() != null) {
-            grantRecords = refreshedCacheEntry.getEntityGrantRecords();
-            grantRecordsVersion = refreshedCacheEntry.getGrantRecordsVersion();
-          } else {
-            grantRecords = existingCacheEntry.getAllGrantRecords();
-            grantRecordsVersion = 
existingCacheEntry.getEntity().getGrantRecordsVersion();
-          }
-        } else {
-          // entity has been purged, remove it
-          this.removeCacheEntry(existingCacheEntry);
-          return null;
-        }
+    // Either cache miss, dropped entity or stale entity. In either case, 
invalidate and reload it.
+    // Do the refresh in a critical section based on the entity ID to avoid 
race conditions.
+    lock.writeLock().lock();
+    try {
+      // Lookup the cache again in case another thread has already invalidated 
it.
+      existingCacheEntry = this.getEntityById(entityId);
+
+      // See if the entity has been refreshed concurrently
+      if (existingCacheEntry != null
+          && existingCacheEntry.getEntity().getEntityVersion() >= 
entityMinVersion
+          && existingCacheEntry.getEntity().getGrantRecordsVersion()
+              >= entityGrantRecordsMinVersion) {
+        // Cache hit and cache entry is up to date now, exit
+        return existingCacheEntry;
+      }
+
+      // Either confirmed cache miss, or the entity is stale. Invalidate it 
just in case.
+      cache.invalidate(entityId);
+
+      // Get the entity from the cache or reload it now that it has been 
invalidated
+      EntityCacheLookupResult cacheLookupResult =
+          this.maybeLoadEntityById(

Review Comment:
   Though I see the appeal of eliminating "redundant database calls", the 
fundamental tradeoff is indeed whether to allow redundant database calls but 
lock a smaller section vs locking the bigger section for more contention.
   
   In a way this impl is using the writeLock for two distinct purposes:
   
   1. Protect the in-memory atomicity of byName and byId, e.g. a "data 
structure" lock
   2. Prevent concurrent writers from doing duplicate fetches
   
   The semantic of (2) only really requires "locking" the exact entityId (and 
perhaps entityName) being contended, essentially letting later read-through 
attempts wait on a Future for something already fetching inflight. The problem 
is that we probably don't want to deal with the complexity of fine-grained 
per-entity-id locks/semaphores, so now the writeLock causes 
cross-catalog/cross-entity contention.
   
   The behavior here basically changes two ways now:
   
   1. If the distributed database has some poisoned ranges/catalogIds that are 
super-slow or hang forever, the writeLock now allows the poisoned ranges to 
pollute the ability to serve requests (since for now cache-updates aren't 
"best-effort" but actually block serving the request entirely) on other 
catalogs/entities
   2. Even if the database backend isn't systematically poisoned, but if 
someone uses a stateless proxy layer in front of the distributed database like 
some managed Polaris services do today, any hiccups in the stateless layer 
effectively allow cross-service pollution of latency spikes in situations where 
"duplicate" attempts might've allowed 9 out of 10 requests to succeed from a 
healthy proxy layer even if the first-to-arrive attempt happens to get stuck on 
an unlucky host. This case is problematic even if we do introduce fine-grained 
locking.
   
   If we split out this entire `InMemoryEntityCache` implementation as a 
separate option, then users who have simple database backend layers that don't 
exhibit such heterogeneity can choose this one, while those with more 
distributed layers might choose the original impl.
   
   We can also add TODOs to maybe provide a configurable option to this cache 
for whether to optimize for "no duplicate database fetches" or to optimize for 
liveness in the face of heterogeneous backend unhealthiness. Then the latter 
would do the call to the MetaStoreManager outside of any locks and reconcile 
after re-acquiring the lock. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to