RussellSpitzer commented on PR #11781: URL: https://github.com/apache/iceberg/pull/11781#issuecomment-2569914295
From a discussion I had with @sopel39 today; I think we can go forward this solution but I think it will basically re-introduce the memory usage issue that we saw previously. (For some use cases) From our discussion I believe we have been working at this from the wrong direction. ## Recap - The original implementation basically assumed that we could read as made iterators as parallelism allowed. This led to basically unbounded memory usage. This ended up becoming an issue for systems using a lightweight coordinator now requiring them to have a huge memory footprint. Next, to solve our issue with the memory footprint we added what is essentially a buffered read-ahead. A queue is filled with elements from the files we are reading in parallel and we check the queue depth every time we add elements to bound its size. The max queue size here is checked at every pull so we really can never go more than parallelism items over the max queue size. Unfortunately, this leads to the current deadlock issue since we can potentially yield an iterator in the middle of a file and be left only with iterators for files which cannot yet be opened because all file handles are owned by yielded iterators. The current proposed solution is to switch checking the queue size per element and instead check only before opening a new file for the first time. This means that any file that is opened is read completely into the read-ahead queue. This fixes the dead lock issue as we will never yield in the middle of the file but possibly reintroduces the memory issue. We could potentially open up to "parallelism" files and load them all into the queue before having a chance to lead so we may be back where we started. ## Where do we go from here - The current implementation is basically trying to solve the issue of : *How do we read a bunch of unbounded streams in parallel without consuming too much memory?* But this is actually a bit over general for what we are actually trying to do. Our actual problem is *How do we read from a bunch of files without reading too many records at a time? The key difference here is that we know exactly how long each file is *before* we open it. Instead of simply opening every file and checking how much we've actually read in parallel, we can open only as many files as are required to fill our read ahead buffer and yield on the others. I think what we should do is a bit like this (haven't thought this part through too much) ``` AtomicLong usedBuffer if file not opened Synchronized { if (usedBuffer + file.length < MAX_BUFFER) { usedBuffer += file.length } else { yield, do not yet open this file } Read File into Buffer onNext { usedBuffer -- } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org