kevinjqliu commented on issue #1162: URL: https://github.com/apache/iceberg-python/issues/1162#issuecomment-2364113301
I took a step back and realized the fundamental issue was the newly introduced cache. Without the cache, everything works fine. With the cache, things break. Going a layer deeper, this probably means the bug is only for cache hits, as cache misses will just recompute. So the failure scenario is when the cache hits, but the return value is wrong. Fundamentally, there are a couple issues with the function definition ``` @lru_cache def _manifests(io: FileIO, manifest_list: str) -> List[ManifestFile]: """Return the manifests from the manifest list.""" file = io.new_input(manifest_list) return list(read_manifest_list(file)) ``` First, the cache key is both io and manifest_list, whereas we just want the key to be manifest_list Second, the result is a list, which can be mutated leading to the wrong result. Here’s an example to showcase the different cache keys ``` cache = {} def _manifests(io: FileIO, manifest_list: str, snapshot: Snapshot) -> List[ManifestFile]: """Return the manifests from the manifest list.""" # key = (manifest_list, ) # works # key = (manifest_list, io) # fails key = (manifest_list, snapshot) # works if key in cache: return cache[key] cache[key] = list(read_manifest_list(io.new_input(manifest_list))) return cache[key] ``` Without digging into where it is breaking or why only for M1 Macs, there are 2 potential solutions: 1. Move the manifest cache to the Snapshot instance 2. Use the `cachetools` library to specify `manifest_list` as the only cache key (see [stack overflow](https://stackoverflow.com/questions/30730983/make-lru-cache-ignore-some-of-the-function-arguments)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org