kevinjqliu commented on issue #1162:
URL: 
https://github.com/apache/iceberg-python/issues/1162#issuecomment-2364113301

   I took a step back and realized the fundamental issue was the newly 
introduced cache.
   
   Without the cache, everything works fine.
   With the cache, things break. 
   
   Going a layer deeper, this probably means the bug is only for cache hits, as 
cache misses will just recompute. 
   So the failure scenario is when the cache hits, but the return value is 
wrong. 
   
   Fundamentally, there are a couple issues with the function definition
   ```
   @lru_cache
   def _manifests(io: FileIO, manifest_list: str) -> List[ManifestFile]:
       """Return the manifests from the manifest list."""
       file = io.new_input(manifest_list)
       return list(read_manifest_list(file))
   ```
   First, the cache key is both io and manifest_list, whereas we just want the 
key to be manifest_list
   Second, the result is a list, which can be mutated leading to the wrong 
result. 
   
   Here’s an example to showcase the different cache keys
   ```
   cache = {}
   
   def _manifests(io: FileIO, manifest_list: str, snapshot: Snapshot) -> 
List[ManifestFile]:
       """Return the manifests from the manifest list."""
       # key = (manifest_list, )  # works
       # key = (manifest_list, io)  # fails
       key = (manifest_list, snapshot)  # works
       if key in cache:
           return cache[key]
       cache[key] = list(read_manifest_list(io.new_input(manifest_list)))
       return cache[key]
   ```
   
   Without digging into where it is breaking or why only for M1 Macs, there are 
2 potential solutions:
   1. Move the manifest cache to the Snapshot instance
   2. Use the `cachetools` library to specify `manifest_list` as the only cache 
key (see [stack 
overflow](https://stackoverflow.com/questions/30730983/make-lru-cache-ignore-some-of-the-function-arguments))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to