Hey a few folks reached out indicating that I didn't properly share the last v4 metadata tree meeting recording. So sorry about that! Here's the link <https://drive.google.com/file/d/1LhDL0Iy8YR4RN_W3D8APOUtkSBYk61fD/view?usp=drive_link> , do let me know if there are still issues.
On Tue, Mar 3, 2026 at 9:17 AM Steven Wu <[email protected]> wrote: > My takeaway from the conversation is also that we don't need row-level > column updates. Manifest DV can be used for row-level updates instead. > Basically, a file (manifest or data) can be updated via (1) delete vector + > updated rows in a new file (2) column file overlay. Depends on the > percentage of modified rows, engines can choose which way to go. > > On Tue, Mar 3, 2026 at 6:24 AM Gábor Kaszab <[email protected]> > wrote: > >> Thanks for the summary, Micah! I tried to watch the recording linked to >> the calendar event, but apparently I don't have permission to do so. Not >> sure about others. >> >> So if 'm not mistaken, one way to reduce the write cost of an UPDATE for >> colocated DVs is to use the column updates. As I see there was some >> agreement that row-level partial column updates aren't desired, and we aim >> for at least file-level column updates. This is very useful information for >> the other conversation >> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny> going >> on for the column update proposal. We can bring this up on the column >> update sync tomorrow, but I'm wondering if the consensus on avoiding >> row-level column updates is something we can incorporate into the column >> update proposal too or if it's something still up to debate. >> >> Best Regards, >> Gabor >> >> Micah Kornfield <[email protected]> ezt írta (időpont: 2026. febr. >> 25., Sze, 22:30): >> >>> Just wanted to summarize my main takeaways of Monday's sync. >>> >>> The approach will always collocate DVs with the data files (i.e. every >>> data file row in a manifest has an optional DV reference). This implies >>> that there is not a separate "Deletion manifest". Rather in V4 all >>> manifests are "combined" where data files and DVs are colocated. >>> >>> Write amplification is avoided in two ways: >>> 1. For small updates we will need to carry through metadata statistics >>> (and other relevant data file fields) in memory (rescanning these is likely >>> two expensive). Once updates are available they will be written out a >>> new manifest (either root or leaf) and use metadata DVs to remove the old >>> rows. >>> 2. For larger updates we will only carry through the DV update parts in >>> memory and use column level updates to replace existing DVs (this would >>> require rescanning the DV columns for any updated manifest to merge with >>> the updated DVs in memory, and then writing out the column update). The >>> consensus on the call is that we didn't want to support partial column >>> updates (a.k.a. merge-on-read column updates). >>> >>> The idea is that engines would decide which path to follow based on the >>> number of affected files. >>> >>> To help understand the implications of the new proposal, I put together >>> a quick spreadsheet [1] to analyze trade-offs between separate deletion >>> manifests and the new approach under scenario 1 and 2. This represents the >>> worst case scenario where file updates are uniformly distributed across a >>> single update operation. It does not account for repeated writes (e.g. >>> on-going compaction). My main take-aways is that keeping at most 1 >>> affiliated DV separate might still help (akin to a merge on read column >>> update), but maybe not enough relative to other parts of the system (e.g. >>> the churn on data files) that the complexity. >>> >>> Hope this is helpful. >>> >>> Micah >>> >>> [1] >>> https://docs.google.com/spreadsheets/d/1klZQxV7ST2C-p9LTMmai_5rtFiyupj6jSLRPRkdI-u8/edit?gid=0#gid=0 >>> >>> >>> >>> On Thu, Feb 19, 2026 at 3:52 PM Amogh Jahagirdar <[email protected]> >>> wrote: >>> >>>> Hey folks, I've set up an additional initial discussion on DVs for >>>> Monday. This topic is fairly complex and there is also now a free calendar >>>> slot. I think it'd be helpful for us to first make sure we're all on the >>>> same page in terms of what the approach proposed by Anton earlier in the >>>> thread means and the high level mechanics. I should also have more to share >>>> on the doc about how the entry structure and change detection could look >>>> like in this approach. Then on Thursday we can get into more details and >>>> targeted points of discussion on this topic. >>>> >>>> Thanks, >>>> Amogh Jahagirdar >>>> >>>> On Tue, Feb 17, 2026 at 9:27 PM Amogh Jahagirdar <[email protected]> >>>> wrote: >>>> >>>>> Thanks Steven! I've set up some time next Thursday for the community >>>>> to discuss this. We're also looking at how the content entry would look >>>>> like in a combined DV with potential column updates for DV changes, and >>>>> how >>>>> change detection could look like in this approach. I should have more to >>>>> share on this by the time of the community discussion next week. >>>>> We should also consider potential root churn and memory consumption >>>>> stemming from expected root entry inflation due to a combined data file + >>>>> DV entry with possible column updates for certain DV workloads; though at >>>>> least for memory consumption of stats being held after planning, that >>>>> arguably is an implementation problem for certain integrations. >>>>> >>>>> Thanks, >>>>> Amogh Jahagirdar >>>>> >>>>> On Fri, Feb 13, 2026 at 10:58 AM Steven Wu <[email protected]> >>>>> wrote: >>>>> >>>>>> I wrote up some analysis with back-of-the-envelope calculations about >>>>>> the column update approach for DV colocation. It mainly concerns the 2nd >>>>>> use case: deleting a large number of rows from a small number of files. >>>>>> >>>>>> >>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.gvdulzy486n7 >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Feb 4, 2026 at 1:02 AM Péter Váry < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I fully agree with Anton and Steven that we need benchmarks before >>>>>>> choosing any direction. >>>>>>> >>>>>>> I ran some preliminary column‑stitching benchmarks last summer: >>>>>>> >>>>>>> - Results are available in the doc: >>>>>>> >>>>>>> https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww >>>>>>> - Code is here: https://github.com/apache/iceberg/pull/13306 >>>>>>> >>>>>>> I’ve summarized the most relevant results at the end of this email. >>>>>>> They show roughly a 10% slowdown on the read path with column stitching >>>>>>> in >>>>>>> similar scenarios when using local SSDs. I expect that in real >>>>>>> deployments >>>>>>> the metadata read cost will mostly be driven by blob I/O (assuming no >>>>>>> caching). If blob access becomes the dominant factor in read latency, >>>>>>> multithreaded fetching should be able to absorb the overhead introduced >>>>>>> by >>>>>>> column stitching, resulting in latency similar to the single‑file layout >>>>>>> (unless IO is already the bottleneck) >>>>>>> >>>>>>> We should definitely rerun the benchmarks once we have a clearer >>>>>>> understanding of the intended usage patterns. >>>>>>> Thanks, >>>>>>> Peter >>>>>>> >>>>>>> >>>>>>> The relevant(ish) results are for 100 columns, with 2 families with >>>>>>> 50-50 columns and local read: >>>>>>> >>>>>>> The base is: >>>>>>> MultiThreadedParquetBenchmark.read 100 0 >>>>>>> false ss 20 3.739 ± 0.096 s/op >>>>>>> >>>>>>> The read for single threaded: >>>>>>> MultiThreadedParquetBenchmark.read 100 2 >>>>>>> false ss 20 4.036 ± 0.082 s/op >>>>>>> >>>>>>> The read for multi threaded: >>>>>>> MultiThreadedParquetBenchmark.read 100 2 >>>>>>> true ss 20 4.063 ± 0.080 s/op >>>>>>> >>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2026. febr. 3., >>>>>>> K, 23:27): >>>>>>> >>>>>>>> >>>>>>>> I agree with Anton in this >>>>>>>> <https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o/edit?disco=AAAByzDx21w> >>>>>>>> comment thread that we probably need to run benchmarks for a few common >>>>>>>> scenarios to guide this decision. We need to write down detailed plans >>>>>>>> for >>>>>>>> those scenarios and what are we measuring. Also ideally, we want to >>>>>>>> measure >>>>>>>> using the V4 metadata structure (like Parquet manifest file, column >>>>>>>> stats >>>>>>>> structs, adaptive tree). There are PoC PRs available for column stats, >>>>>>>> Parquet manifest, and root manifest. It would probably be tricky to >>>>>>>> piece >>>>>>>> them together to run the benchmark considering the PoC status. We also >>>>>>>> need >>>>>>>> the column stitching capability on the read path to test the column >>>>>>>> file >>>>>>>> approach. >>>>>>>> >>>>>>>> On Tue, Feb 3, 2026 at 1:53 PM Anoop Johnson <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I'm in favor of co-located DV metadata with column file override >>>>>>>>> and not doing affiliated/unaffiliated delete manifests. This is >>>>>>>>> conceptually similar to strictly affiliated delete manifests with >>>>>>>>> positional joins, and will halve the number of I/Os when there is no >>>>>>>>> DV >>>>>>>>> column override. It is simpler to implement >>>>>>>>> and will speed up reads. >>>>>>>>> >>>>>>>>> Unaffiliated DV manifests are flexible for writers. They reduce >>>>>>>>> the chance of physical conflicts when there are concurrent >>>>>>>>> large/random >>>>>>>>> deletes that change DVs on different files in the same manifest. But >>>>>>>>> the >>>>>>>>> flexibility comes at a read-time cost. If the number of unaffiliated >>>>>>>>> DVs >>>>>>>>> exceeds a threshold, it could cause driver OOMs or require >>>>>>>>> distributed join >>>>>>>>> to pair up DVs with data files. With colocated metadata, manifest DVs >>>>>>>>> can >>>>>>>>> reduce the chance of conflicts up to a certain write size. >>>>>>>>> >>>>>>>>> I assume we will still support unaffiliated manifests for equality >>>>>>>>> deletes, but perhaps we can restrict it to just equality deletes. >>>>>>>>> >>>>>>>>> -Anoop >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Feb 2, 2026 at 4:27 PM Anton Okolnychyi < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> I added the approach with column files to the doc. >>>>>>>>>> >>>>>>>>>> To sum up, separate data and delete manifests with affinity >>>>>>>>>> would perform somewhat on par with co-located DV metadata (a.k.a. >>>>>>>>>> direct >>>>>>>>>> assignment) if we add support for column files when we need to >>>>>>>>>> replace most >>>>>>>>>> or all DVs (use case 1). That said, the support for direct >>>>>>>>>> assignment with >>>>>>>>>> in-line metadata DVs can help us avoid unaffiliated delete manifests >>>>>>>>>> when >>>>>>>>>> we need to replace a few DVs (use case 2). >>>>>>>>>> >>>>>>>>>> So the key question is whether we want to allow >>>>>>>>>> unaffiliated delete manifests with DVs... If we don't, then we would >>>>>>>>>> likely >>>>>>>>>> want to have co-located DV metadata and must support efficient column >>>>>>>>>> updates not to regress compared to V2 and V3 for large MERGE jobs >>>>>>>>>> that >>>>>>>>>> modify a small set of records for most files. >>>>>>>>>> >>>>>>>>>> пн, 2 лют. 2026 р. о 13:20 Anton Okolnychyi < >>>>>>>>>> [email protected]> пише: >>>>>>>>>> >>>>>>>>>>> Anoop, correct, if we keep data and delete manifests separate, >>>>>>>>>>> there is a better way to combine the entries and we should NOT rely >>>>>>>>>>> on the >>>>>>>>>>> referenced data file path. Reconciling by implicit position will >>>>>>>>>>> reduce the >>>>>>>>>>> size of the DV entry (no need to store the referenced data file >>>>>>>>>>> path) and >>>>>>>>>>> will improve the planning performance (no equals/hashCode on the >>>>>>>>>>> path). >>>>>>>>>>> >>>>>>>>>>> Steven, I agree. Most notes in the doc pre-date discussions we >>>>>>>>>>> had on column updates. You are right, given that we are gravitating >>>>>>>>>>> towards >>>>>>>>>>> a native way to handle column updates, it seems logical to use the >>>>>>>>>>> same >>>>>>>>>>> approach for replacing DVs, since they’re essentially column >>>>>>>>>>> updates. Let >>>>>>>>>>> me add one more approach to the doc based on what Anurag and Peter >>>>>>>>>>> have so >>>>>>>>>>> far. >>>>>>>>>>> >>>>>>>>>>> нд, 1 лют. 2026 р. о 20:59 Steven Wu <[email protected]> >>>>>>>>>>> пише: >>>>>>>>>>> >>>>>>>>>>>> Anton, thanks for raising this. I agree this deserves another >>>>>>>>>>>> look. I added a comment in your doc that we can potentially apply >>>>>>>>>>>> the >>>>>>>>>>>> column update proposal for data file update to the manifest file >>>>>>>>>>>> updates as >>>>>>>>>>>> well, to colocate the data DV and data manifest files. Data DVs >>>>>>>>>>>> can be a >>>>>>>>>>>> separate column in the data manifest file and updated separately >>>>>>>>>>>> in a >>>>>>>>>>>> column file. This is the same as the coalesced positional join >>>>>>>>>>>> that Anoop >>>>>>>>>>>> mentioned. >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Feb 1, 2026 at 4:14 PM Anoop Johnson <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thank you for raising this, Anton. I had a similar observation >>>>>>>>>>>>> while prototyping >>>>>>>>>>>>> <https://github.com/apache/iceberg/pull/14533> the >>>>>>>>>>>>> adaptive metadata tree. The overhead of doing a path-based hash >>>>>>>>>>>>> join of a >>>>>>>>>>>>> data manifest with the affiliated delete manifest is high: my >>>>>>>>>>>>> estimate was >>>>>>>>>>>>> that the join adds about 5-10% overhead. The hash table >>>>>>>>>>>>> build/probe alone >>>>>>>>>>>>> takes about 5 ms for manifests with 25K entries. There are >>>>>>>>>>>>> engines that can >>>>>>>>>>>>> do vectorized hash joins that can lower this, but the overhead and >>>>>>>>>>>>> complexity of a SIMD-friendly hash join is non-trivial. >>>>>>>>>>>>> >>>>>>>>>>>>> An alternative to relying on the external file feature in >>>>>>>>>>>>> Parquet, is to make affiliated manifests order-preserving: ie DVs >>>>>>>>>>>>> in an >>>>>>>>>>>>> affiliated delete manifest must appear in the same position as the >>>>>>>>>>>>> corresponding data file in the data manifest the delete manifest >>>>>>>>>>>>> is >>>>>>>>>>>>> affiliated to. If a data file does not have a DV, the DV >>>>>>>>>>>>> manifest must >>>>>>>>>>>>> store a NULL. This would allow us to do positional joins, which >>>>>>>>>>>>> are much >>>>>>>>>>>>> faster. If we wanted, we could even have multiple affiliated DV >>>>>>>>>>>>> manifests >>>>>>>>>>>>> for a data manifest and the reader would do a COALESCED >>>>>>>>>>>>> positional join >>>>>>>>>>>>> (i.e. pick the first non-null value as the DV). It puts the >>>>>>>>>>>>> sorting >>>>>>>>>>>>> responsibility to the writers, but it might be a reasonable >>>>>>>>>>>>> tradeoff. >>>>>>>>>>>>> >>>>>>>>>>>>> Also, the options don't necessarily have to be mutually >>>>>>>>>>>>> exclusive. We could still allow affiliated DVs to be "folded" >>>>>>>>>>>>> into data >>>>>>>>>>>>> manifest (e.g. by background optimization jobs or the writer >>>>>>>>>>>>> itself). That >>>>>>>>>>>>> might be the optimal choice for read-heavy tables because it will >>>>>>>>>>>>> halve the >>>>>>>>>>>>> number of I/Os readers have to make. >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Anoop >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Jan 30, 2026 at 6:03 PM Anton Okolnychyi < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I had a chance to catch up on some of the V4 discussions. >>>>>>>>>>>>>> Given that we are getting rid of the manifest list and switching >>>>>>>>>>>>>> to >>>>>>>>>>>>>> Parquet, I wanted to re-evaluate the possibility of direct DV >>>>>>>>>>>>>> assignment >>>>>>>>>>>>>> that we discarded in V3 to avoid regressions. I have put >>>>>>>>>>>>>> together my >>>>>>>>>>>>>> thoughts in a doc [1]. >>>>>>>>>>>>>> >>>>>>>>>>>>>> TL;DR: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - I think the current V4 proposal that keeps data and delete >>>>>>>>>>>>>> manifests separate but introduces affinity is a solid choice for >>>>>>>>>>>>>> cases when >>>>>>>>>>>>>> we need to replace DVs in many / most files. I outlined an >>>>>>>>>>>>>> approach with >>>>>>>>>>>>>> column-split Parquet files but it doesn't improve the >>>>>>>>>>>>>> performance and takes >>>>>>>>>>>>>> dependency on a portion of the Parquet spec that is not really >>>>>>>>>>>>>> implemented. >>>>>>>>>>>>>> - Pushing unaffiliated DVs directly into the root to replace >>>>>>>>>>>>>> a small set of DVs is going to be fast on write but does require >>>>>>>>>>>>>> resolving >>>>>>>>>>>>>> where those DVs apply at read time. Using inline metadata DVs >>>>>>>>>>>>>> with >>>>>>>>>>>>>> column-split Parquet files is a little more promising in this >>>>>>>>>>>>>> case as it >>>>>>>>>>>>>> allows to avoid unaffiliated DVs. That said, it again relies on >>>>>>>>>>>>>> something >>>>>>>>>>>>>> Parquet doesn't implement right now, requires changing >>>>>>>>>>>>>> maintenance >>>>>>>>>>>>>> operations, and yields minimal benefits. >>>>>>>>>>>>>> >>>>>>>>>>>>>> All in all, the V4 proposal seems like a strict improvement >>>>>>>>>>>>>> over V3 but I insist that we reconsider usage of the referenced >>>>>>>>>>>>>> data file >>>>>>>>>>>>>> path when resolving DVs to data files. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] - >>>>>>>>>>>>>> https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Anton >>>>>>>>>>>>>> >>>>>>>>>>>>>> сб, 22 лист. 2025 р. о 13:37 Amogh Jahagirdar < >>>>>>>>>>>>>> [email protected]> пише: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hey all, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here is the meeting recording >>>>>>>>>>>>>>> <https://drive.google.com/file/d/1lG9sM-JTwqcIgk7JsAryXXCc1vMnstJs/view?usp=sharing> >>>>>>>>>>>>>>> and generated meeting summary >>>>>>>>>>>>>>> <https://docs.google.com/document/d/1e50p8TXL2e3CnUwKMOvm8F4s2PeVMiKWHPxhxOW1fIM/edit?usp=sharing>. >>>>>>>>>>>>>>> Thanks all for attending yesterday! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Nov 20, 2025 at 8:49 AM Amogh Jahagirdar < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hey folks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I was out for some time, but set up a sync for tomorrow at >>>>>>>>>>>>>>>> 9am PST. For this discussion, I do think it would be great to >>>>>>>>>>>>>>>> focus on the >>>>>>>>>>>>>>>> manifest DV representation, factoring in analyses on bitmap >>>>>>>>>>>>>>>> representation >>>>>>>>>>>>>>>> storage footprints, and the entry structure considering how we >>>>>>>>>>>>>>>> want to >>>>>>>>>>>>>>>> approach change detection. If there are other topics that >>>>>>>>>>>>>>>> people want to >>>>>>>>>>>>>>>> highlight, please do bring those up as well! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I also recognize that this is a bit short term scheduling, >>>>>>>>>>>>>>>> so please do reach out to me if this time is difficult to work >>>>>>>>>>>>>>>> with; next >>>>>>>>>>>>>>>> week is the Thanksgiving holidays here, and since people would >>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>> travelling/out I figured I'd try to schedule before then. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Oct 17, 2025 at 9:03 AM Amogh Jahagirdar < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hey folks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sorry for the delay, here's the recording link >>>>>>>>>>>>>>>>> <https://drive.google.com/file/d/1YOmPROXjAKYAWAcYxqAFHdADbqELVVf2/view> >>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>> last week's discussion. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Oct 10, 2025 at 9:44 AM Péter Váry < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Same here. >>>>>>>>>>>>>>>>>> Please record if you can. >>>>>>>>>>>>>>>>>> Thanks, Peter >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Oct 10, 2025, 17:39 Fokko Driesprong < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hey Amogh, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks for the write-up. Unfortunately, I won’t be able >>>>>>>>>>>>>>>>>>> to attend. Will it be recorded? Thanks! >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Op di 7 okt 2025 om 20:36 schreef Amogh Jahagirdar < >>>>>>>>>>>>>>>>>>> [email protected]> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hey all, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I've setup time this Friday at 9am PST for another sync >>>>>>>>>>>>>>>>>>>> on single file commits. In terms of what would be great to >>>>>>>>>>>>>>>>>>>> focus on for the >>>>>>>>>>>>>>>>>>>> discussion: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 1. Whether it makes sense or not to eliminate the >>>>>>>>>>>>>>>>>>>> tuple, and instead representing the tuple via lower/upper >>>>>>>>>>>>>>>>>>>> boundaries. As a >>>>>>>>>>>>>>>>>>>> reminder, one of the goals is to avoid tying a partition >>>>>>>>>>>>>>>>>>>> spec to a >>>>>>>>>>>>>>>>>>>> manifest; in the root we can have a mix of files spanning >>>>>>>>>>>>>>>>>>>> different >>>>>>>>>>>>>>>>>>>> partition specs, and even in leaf manifests avoiding this >>>>>>>>>>>>>>>>>>>> coupling can >>>>>>>>>>>>>>>>>>>> enable more desirable clustering of metadata. >>>>>>>>>>>>>>>>>>>> In the vast majority of cases, we could leverage the >>>>>>>>>>>>>>>>>>>> property that a file is effectively partitioned if the >>>>>>>>>>>>>>>>>>>> lower/upper for a >>>>>>>>>>>>>>>>>>>> given field is equal. The nuance here is with the >>>>>>>>>>>>>>>>>>>> particular case of >>>>>>>>>>>>>>>>>>>> identity partitioned string/binary columns which can be >>>>>>>>>>>>>>>>>>>> truncated in stats. >>>>>>>>>>>>>>>>>>>> One approach is to require that writers must not produce >>>>>>>>>>>>>>>>>>>> truncated stats >>>>>>>>>>>>>>>>>>>> for identity partitioned columns. It's also important to >>>>>>>>>>>>>>>>>>>> keep in mind that >>>>>>>>>>>>>>>>>>>> all of this is just for the purpose of reconstructing the >>>>>>>>>>>>>>>>>>>> partition tuple, >>>>>>>>>>>>>>>>>>>> which is only required during equality delete matching. >>>>>>>>>>>>>>>>>>>> Another area we >>>>>>>>>>>>>>>>>>>> need to cover as part of this is on exact bounds on stats. >>>>>>>>>>>>>>>>>>>> There are other >>>>>>>>>>>>>>>>>>>> options here as well such as making all new equality >>>>>>>>>>>>>>>>>>>> deletes in V4 be >>>>>>>>>>>>>>>>>>>> global and instead match based on bounds, or keeping the >>>>>>>>>>>>>>>>>>>> tuple but each >>>>>>>>>>>>>>>>>>>> tuple is effectively based off a union schema of all >>>>>>>>>>>>>>>>>>>> partition specs. I am >>>>>>>>>>>>>>>>>>>> adding a separate appendix section outlining the span of >>>>>>>>>>>>>>>>>>>> options here and >>>>>>>>>>>>>>>>>>>> the different tradeoffs. >>>>>>>>>>>>>>>>>>>> Once we get this more to a conclusive state, I'll move >>>>>>>>>>>>>>>>>>>> a summarized version to the main doc. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2. @[email protected] <[email protected]> has >>>>>>>>>>>>>>>>>>>> updated the doc with a section >>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.rrpksmp8zkb#heading=h.qau0y5xkh9mn> >>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>> how we can do change detection from the root in a variety >>>>>>>>>>>>>>>>>>>> of write >>>>>>>>>>>>>>>>>>>> scenarios. I've done a review on it, and it covers the >>>>>>>>>>>>>>>>>>>> cases I would >>>>>>>>>>>>>>>>>>>> expect. It'd be good for folks to take a look and please >>>>>>>>>>>>>>>>>>>> give feedback >>>>>>>>>>>>>>>>>>>> before we discuss. Thank you Steven for adding that >>>>>>>>>>>>>>>>>>>> section and all the >>>>>>>>>>>>>>>>>>>> diagrams. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thu, Sep 18, 2025 at 3:19 PM Amogh Jahagirdar < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hey folks just following up from the discussion last >>>>>>>>>>>>>>>>>>>>> Friday with a summary and some next steps: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 1.) For the various change detection cases, we >>>>>>>>>>>>>>>>>>>>> concluded it's best just to go through those in an >>>>>>>>>>>>>>>>>>>>> offline manner on the >>>>>>>>>>>>>>>>>>>>> doc since it's hard to verify all that correctness in a >>>>>>>>>>>>>>>>>>>>> large meeting >>>>>>>>>>>>>>>>>>>>> setting. >>>>>>>>>>>>>>>>>>>>> 2.) We mostly discussed eliminating the >>>>>>>>>>>>>>>>>>>>> partition tuple. On the original proposal, I was mostly >>>>>>>>>>>>>>>>>>>>> aiming for the >>>>>>>>>>>>>>>>>>>>> ability to re-constructing the tuple from the stats for >>>>>>>>>>>>>>>>>>>>> the purpose of >>>>>>>>>>>>>>>>>>>>> equality delete matching (a file is partitioned if the >>>>>>>>>>>>>>>>>>>>> lower and upper >>>>>>>>>>>>>>>>>>>>> bounds are equal); There's some nuance in how we need to >>>>>>>>>>>>>>>>>>>>> handle identity >>>>>>>>>>>>>>>>>>>>> partition values since for string/binary they cannot be >>>>>>>>>>>>>>>>>>>>> truncated. >>>>>>>>>>>>>>>>>>>>> Another potential option is to treat all equality deletes >>>>>>>>>>>>>>>>>>>>> as effectively >>>>>>>>>>>>>>>>>>>>> global and narrow their application based on the stats >>>>>>>>>>>>>>>>>>>>> values. This may >>>>>>>>>>>>>>>>>>>>> require defining tight bounds. I'm still collecting my >>>>>>>>>>>>>>>>>>>>> thoughts on this one. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks folks! Please also let me know if any of the >>>>>>>>>>>>>>>>>>>>> following links are inaccessible for any reason. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Meeting recording link: >>>>>>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Meeting summary: >>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/131N0CDpzZczURxitN0HGS7dTqRxQT_YS9jMECkGGvQU >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 3:40 PM Amogh Jahagirdar < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Update: I moved the discussion time to this Friday at >>>>>>>>>>>>>>>>>>>>>> 9 am PST since I found out that quite a few folks >>>>>>>>>>>>>>>>>>>>>> involved in the proposals >>>>>>>>>>>>>>>>>>>>>> will be out next week, and I also know some folks will >>>>>>>>>>>>>>>>>>>>>> also be out the week >>>>>>>>>>>>>>>>>>>>>> after that. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>> Amogh J >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 8:57 AM Amogh Jahagirdar < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hey folks sorry for the late follow up here, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks @Kevin Liu <[email protected]> for >>>>>>>>>>>>>>>>>>>>>>> sharing the recording link of the previous discussion! >>>>>>>>>>>>>>>>>>>>>>> I've set up another >>>>>>>>>>>>>>>>>>>>>>> sync for next Tuesday 09/16 at 9am PST. This time I've >>>>>>>>>>>>>>>>>>>>>>> set it up from my >>>>>>>>>>>>>>>>>>>>>>> corporate email so we can get recordings and >>>>>>>>>>>>>>>>>>>>>>> transcriptions (and I've made >>>>>>>>>>>>>>>>>>>>>>> sure to keep the meeting invite open so we don't have >>>>>>>>>>>>>>>>>>>>>>> to manually let >>>>>>>>>>>>>>>>>>>>>>> people in). >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> In terms of next steps of areas which I think would >>>>>>>>>>>>>>>>>>>>>>> be good to focus on for establishing consensus: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 1. How do we model the manifest entry structure >>>>>>>>>>>>>>>>>>>>>>> so that changes to manifest DVs can be obtained easily >>>>>>>>>>>>>>>>>>>>>>> from the root? There >>>>>>>>>>>>>>>>>>>>>>> are a few options here; the most promising approach is >>>>>>>>>>>>>>>>>>>>>>> to keep an >>>>>>>>>>>>>>>>>>>>>>> additional DV which encodes the diff in additional >>>>>>>>>>>>>>>>>>>>>>> positions which have >>>>>>>>>>>>>>>>>>>>>>> been removed from a leaf manifest. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2. Modeling partition transforms via expressions and >>>>>>>>>>>>>>>>>>>>>>> establishing a unified table ID space so that we can >>>>>>>>>>>>>>>>>>>>>>> simplify how partition >>>>>>>>>>>>>>>>>>>>>>> tuples may be represented via stats and also have a way >>>>>>>>>>>>>>>>>>>>>>> in the future to >>>>>>>>>>>>>>>>>>>>>>> store stats on any derived column. I have a short >>>>>>>>>>>>>>>>>>>>>>> proposal >>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1oV8dapKVzB4pZy5pKHUCj5j9i2_1p37BJSeT7hyKPpg/edit?tab=t.0> >>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>> this that probably still needs some tightening up on >>>>>>>>>>>>>>>>>>>>>>> the expression >>>>>>>>>>>>>>>>>>>>>>> modeling itself (and some prototyping) but the general >>>>>>>>>>>>>>>>>>>>>>> idea for >>>>>>>>>>>>>>>>>>>>>>> establishing a unified table ID space is covered. All >>>>>>>>>>>>>>>>>>>>>>> feedback welcome! >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Aug 25, 2025 at 1:34 PM Kevin Liu < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks Amogh. Looks like the recording for last >>>>>>>>>>>>>>>>>>>>>>>> week's sync is available on Youtube. Here's the link, >>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=uWm-p--8oVQ >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>>>> Kevin Liu >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Aug 12, 2025 at 9:10 PM Amogh Jahagirdar < >>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hey folks, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Just following up on this to give the community as >>>>>>>>>>>>>>>>>>>>>>>>> to where we're at and my proposed next steps. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I've been editing and merging the contents from >>>>>>>>>>>>>>>>>>>>>>>>> our proposal into the proposal >>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw> >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> Russell and others. For any future comments on docs, >>>>>>>>>>>>>>>>>>>>>>>>> please comment on the >>>>>>>>>>>>>>>>>>>>>>>>> linked proposal. I've also marked it on our doc in >>>>>>>>>>>>>>>>>>>>>>>>> red text so it's clear >>>>>>>>>>>>>>>>>>>>>>>>> to redirect to the other proposal as a source of >>>>>>>>>>>>>>>>>>>>>>>>> truth for comments. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> In terms of next steps, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 1. An important design decision point is around >>>>>>>>>>>>>>>>>>>>>>>>> inline manifest DVs, external manifest DVs or >>>>>>>>>>>>>>>>>>>>>>>>> enabling both. I'm working on >>>>>>>>>>>>>>>>>>>>>>>>> measuring different approaches for representing the >>>>>>>>>>>>>>>>>>>>>>>>> compressed DV >>>>>>>>>>>>>>>>>>>>>>>>> representation since that will inform how many >>>>>>>>>>>>>>>>>>>>>>>>> entries can reasonably fit >>>>>>>>>>>>>>>>>>>>>>>>> in a small root manifest; from that we can derive >>>>>>>>>>>>>>>>>>>>>>>>> implications on different >>>>>>>>>>>>>>>>>>>>>>>>> write patterns and determine the right approach for >>>>>>>>>>>>>>>>>>>>>>>>> storing these manifest >>>>>>>>>>>>>>>>>>>>>>>>> DVs. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 2. Another key point is around determining if/how >>>>>>>>>>>>>>>>>>>>>>>>> we can reasonably enable V4 to represent changes in >>>>>>>>>>>>>>>>>>>>>>>>> the root manifest so >>>>>>>>>>>>>>>>>>>>>>>>> that readers can effectively just infer file level >>>>>>>>>>>>>>>>>>>>>>>>> changes from the root. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 3. One of the aspects of the proposal is getting >>>>>>>>>>>>>>>>>>>>>>>>> away from partition tuple requirement in the root >>>>>>>>>>>>>>>>>>>>>>>>> which currently holds us >>>>>>>>>>>>>>>>>>>>>>>>> to have associativity between a partition spec and a >>>>>>>>>>>>>>>>>>>>>>>>> manifest. These >>>>>>>>>>>>>>>>>>>>>>>>> aspects can be modeled as essentially column stats >>>>>>>>>>>>>>>>>>>>>>>>> which gives a lot of >>>>>>>>>>>>>>>>>>>>>>>>> flexibility into the organization of the manifest. >>>>>>>>>>>>>>>>>>>>>>>>> There are important >>>>>>>>>>>>>>>>>>>>>>>>> details around field ID spaces here which tie into >>>>>>>>>>>>>>>>>>>>>>>>> how the stats are >>>>>>>>>>>>>>>>>>>>>>>>> structured. What we're proposing here is to have a >>>>>>>>>>>>>>>>>>>>>>>>> unified expression ID >>>>>>>>>>>>>>>>>>>>>>>>> space that could also benefit us for storing things >>>>>>>>>>>>>>>>>>>>>>>>> like virtual columns >>>>>>>>>>>>>>>>>>>>>>>>> down the line. I go into this in the proposal but I'm >>>>>>>>>>>>>>>>>>>>>>>>> working on separating >>>>>>>>>>>>>>>>>>>>>>>>> the appropriate parts so that the original proposal >>>>>>>>>>>>>>>>>>>>>>>>> can mostly just focus >>>>>>>>>>>>>>>>>>>>>>>>> on the organization of the content metadata tree and >>>>>>>>>>>>>>>>>>>>>>>>> not how we want to >>>>>>>>>>>>>>>>>>>>>>>>> solve this particular ID space problem. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 4. I'm planning on scheduling a recurring >>>>>>>>>>>>>>>>>>>>>>>>> community sync starting next Tuesday at 9am PST, >>>>>>>>>>>>>>>>>>>>>>>>> every 2 weeks. If I get >>>>>>>>>>>>>>>>>>>>>>>>> feedback from folks that this time will never work, I >>>>>>>>>>>>>>>>>>>>>>>>> can certainly adjust. >>>>>>>>>>>>>>>>>>>>>>>>> For some reason, I don't have the ability to add to >>>>>>>>>>>>>>>>>>>>>>>>> the Iceberg Dev >>>>>>>>>>>>>>>>>>>>>>>>> calendar, so I'll figure that out and update the >>>>>>>>>>>>>>>>>>>>>>>>> thread when the event is >>>>>>>>>>>>>>>>>>>>>>>>> scheduled. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 11:47 AM Russell Spitzer < >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I think this is a great way forward, starting out >>>>>>>>>>>>>>>>>>>>>>>>>> with this much parallel development shows that we >>>>>>>>>>>>>>>>>>>>>>>>>> have a lot of consensus >>>>>>>>>>>>>>>>>>>>>>>>>> already :) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 12:42 PM Amogh Jahagirdar >>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hey folks, just following up on this. It looks >>>>>>>>>>>>>>>>>>>>>>>>>>> like our proposal and the proposal that @Russell >>>>>>>>>>>>>>>>>>>>>>>>>>> Spitzer <[email protected]> shared are >>>>>>>>>>>>>>>>>>>>>>>>>>> pretty aligned. I was just chatting with Russell >>>>>>>>>>>>>>>>>>>>>>>>>>> about this, and we think >>>>>>>>>>>>>>>>>>>>>>>>>>> it'd be best to combine both proposals and have a >>>>>>>>>>>>>>>>>>>>>>>>>>> singular large effort on >>>>>>>>>>>>>>>>>>>>>>>>>>> this. I can also set up a focused community >>>>>>>>>>>>>>>>>>>>>>>>>>> discussion (similar to what >>>>>>>>>>>>>>>>>>>>>>>>>>> we're doing on the other V4 proposals) on this >>>>>>>>>>>>>>>>>>>>>>>>>>> starting sometime next week >>>>>>>>>>>>>>>>>>>>>>>>>>> just to get things moving, if that works for people. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 9:48 PM Amogh Jahagirdar >>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Russell, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for sharing the proposal! A few of us >>>>>>>>>>>>>>>>>>>>>>>>>>>> (Ryan, Dan, Anoop and I) have also been working on >>>>>>>>>>>>>>>>>>>>>>>>>>>> a proposal for an >>>>>>>>>>>>>>>>>>>>>>>>>>>> adaptive metadata tree structure as part of >>>>>>>>>>>>>>>>>>>>>>>>>>>> enabling more efficient one >>>>>>>>>>>>>>>>>>>>>>>>>>>> file commits. From a read of the summary, it's >>>>>>>>>>>>>>>>>>>>>>>>>>>> great to see that we're >>>>>>>>>>>>>>>>>>>>>>>>>>>> thinking along the same lines about how to tackle >>>>>>>>>>>>>>>>>>>>>>>>>>>> this fundamental area! >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is our proposal: >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0 >>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 8:08 PM Russell Spitzer >>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey y'all! >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> We (Yi Fang, Steven Wu and Myself) wanted to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> share some >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the thoughts we had on how one-file commits >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could work in Iceberg. This is pretty >>>>>>>>>>>>>>>>>>>>>>>>>>>>> much just a high level overview of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> concepts we think we need and how Iceberg would >>>>>>>>>>>>>>>>>>>>>>>>>>>>> behave. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> We haven't gone very far into the actual >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation and changes that would need to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> occur in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> SDK to make this happen. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The high level summary is: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Manifest Lists are out >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Root Manifests take their place >>>>>>>>>>>>>>>>>>>>>>>>>>>>> A Root manifest can have data manifests, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> delete manifests, manifest delete vectors, data >>>>>>>>>>>>>>>>>>>>>>>>>>>>> delete vectors and data >>>>>>>>>>>>>>>>>>>>>>>>>>>>> files >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Manifest delete vectors allow for modifying >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a manifest without deleting it entirely >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Data files let you append without writing an >>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediary manifest >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Having child data and delete manifests lets >>>>>>>>>>>>>>>>>>>>>>>>>>>>> you still scale >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please take a look if you like, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm excited to see what other proposals and >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideas are floating around the community, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Russ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge < >>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Very excited about the idea! >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop Johnson < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm very interested in this initiative. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Micah Kornfield and I presented >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on high-throughput ingestion for Iceberg tables >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the 2024 Iceberg Summit, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which leveraged Google infrastructure like >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Colossus for efficient appends. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This new proposal is particularly exciting >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it offers significant advancements in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> commit latency and metadata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> storage footprint. Furthermore, a consistent >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifest structure promises to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simplify the design and codebase, which is a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> major benefit. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A related idea I've been exploring is having >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a loose affinity between data and delete >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifests. While the current >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> separation of data and delete manifests in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg is valuable for avoiding >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data file rewrites (and stats updates) when >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> deletes change, it does >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> necessitate a join operation during reads. I'd >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be keen to discuss >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> approaches that could potentially reduce this >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> read-side cost while >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> retaining the benefits of separate manifests. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anoop >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sidhu <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am new to the Iceberg community but would >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> love to participate in these discussions to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce the number of file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> writes, especially for small writes/commits. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you! >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Jagdeep >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Mantripragada >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We have been hitting all the metadata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problems you mentioned, Ryan. I’m on-board to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help however I can to improve >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this area. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~ Anurag Mantripragada >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheng <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am interested in this idea and looking >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> forward to collaboration. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Huang-Hsiang >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am interested in contributing to this >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> effort. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Namratha >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jahagirdar <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for kicking this thread off Ryan, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm interested in helping out here! I've >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> been working on a proposal in this >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> area and it would be great to collaborate >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with different folks and exchange >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ideas here, since I think a lot of people >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are interested in solving this >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Like Russell’s recent note, I’m starting >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a thread to connect those of us that are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in the idea of changing >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg’s metadata in v4 so that in most >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cases committing a change only >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requires writing one additional metadata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *Idea: One-file commits* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The current Iceberg metadata structure >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requires writing at least one manifest and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a new manifest list to produce a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> new snapshot. The goal of this work is to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> allow more flexibility by >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> allowing the manifest list layer to store >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data and delete files. As a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> result, only one file write would be needed >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> before committing the new >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> snapshot. In addition, this work will also >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try to explore: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Avoiding small manifests that must >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be read in parallel and later compacted >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (metadata maintenance changes) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Extend metadata skipping to use >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> aggregated column ranges that are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatible with geospatial data (manifest >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metadata) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Using soft deletes to avoid >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rewriting existing manifests (metadata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DVs) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you’re interested in these problems, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> please reply! >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> John Zhuge >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
