Re: [DISCUSS] v4 - One file commits

Amogh Jahagirdar Mon, 30 Mar 2026 08:28:25 -0700

Hey a few folks reached out indicating that I didn't properly share the
last v4 metadata tree meeting recording. So sorry about that! Here's the
link
<https://drive.google.com/file/d/1LhDL0Iy8YR4RN_W3D8APOUtkSBYk61fD/view?usp=drive_link>
,
do let me know if there are still issues.


On Tue, Mar 3, 2026 at 9:17 AM Steven Wu <[email protected]> wrote:

> My takeaway from the conversation is also that we don't need row-level
> column updates. Manifest DV can be used for row-level updates instead.
> Basically, a file (manifest or data) can be updated via (1) delete vector +
> updated rows in a new file (2) column file overlay. Depends on the
> percentage of modified rows, engines can choose which way to go.
>
> On Tue, Mar 3, 2026 at 6:24 AM Gábor Kaszab <[email protected]>
> wrote:
>
>> Thanks for the summary, Micah! I tried to watch the recording linked to
>> the calendar event, but apparently I don't have permission to do so. Not
>> sure about others.
>>
>> So if 'm not mistaken, one way to reduce the write cost of an UPDATE for
>> colocated DVs is to use the column updates. As I see there was some
>> agreement that row-level partial column updates aren't desired, and we aim
>> for at least file-level column updates. This is very useful information for
>> the other conversation
>> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny> going
>> on for the column update proposal. We can bring this up on the column
>> update sync tomorrow, but I'm wondering if the consensus on avoiding
>> row-level column updates is something we can incorporate into the column
>> update proposal too or if it's something still up to debate.
>>
>> Best Regards,
>> Gabor
>>
>> Micah Kornfield <[email protected]> ezt írta (időpont: 2026. febr.
>> 25., Sze, 22:30):
>>
>>> Just wanted to summarize my main takeaways of Monday's sync.
>>>
>>> The approach will always collocate DVs with the data files (i.e. every
>>> data file row in a manifest has an optional DV reference).  This implies
>>> that there is not a separate "Deletion manifest".  Rather in V4 all
>>> manifests are "combined" where data files and DVs are colocated.
>>>
>>> Write amplification is avoided in two ways:
>>> 1.  For small updates we will need to  carry through metadata statistics
>>> (and other relevant data file fields) in memory (rescanning these is likely
>>> two expensive).    Once updates are available they will be written out a
>>> new manifest (either root or leaf) and use metadata DVs to remove the old
>>> rows.
>>> 2.  For larger updates we will only carry through the DV update parts in
>>> memory and use column level updates to replace existing DVs (this would
>>> require rescanning the DV columns for any updated manifest to merge with
>>> the updated DVs in memory, and then writing out the column update). The
>>> consensus on the call is that we didn't want to support partial  column
>>> updates (a.k.a. merge-on-read column updates).
>>>
>>> The idea is that engines would decide which path to follow based on the
>>> number of affected files.
>>>
>>> To help understand the implications of the new proposal, I put together
>>> a quick spreadsheet [1] to analyze trade-offs between separate deletion
>>> manifests and the new approach under scenario 1 and 2.  This represents the
>>> worst case scenario where file updates are uniformly distributed across a
>>> single update operation.  It does not account for repeated writes (e.g.
>>> on-going compaction).  My main take-aways is that keeping at most 1
>>> affiliated DV separate might still help (akin to a merge on read column
>>> update), but maybe not enough relative to other parts of the system (e.g.
>>> the churn on data files) that the complexity.
>>>
>>> Hope this is helpful.
>>>
>>> Micah
>>>
>>> [1]
>>> https://docs.google.com/spreadsheets/d/1klZQxV7ST2C-p9LTMmai_5rtFiyupj6jSLRPRkdI-u8/edit?gid=0#gid=0
>>>
>>>
>>>
>>> On Thu, Feb 19, 2026 at 3:52 PM Amogh Jahagirdar <[email protected]>
>>> wrote:
>>>
>>>> Hey folks, I've set up an additional initial discussion on DVs for
>>>> Monday. This topic is fairly complex and there is also now a free calendar
>>>> slot. I think it'd be helpful for us to first make sure we're all on the
>>>> same page in terms of what the approach proposed by Anton earlier in the
>>>> thread means and the high level mechanics. I should also have more to share
>>>> on the doc about how the entry structure and change detection could look
>>>> like in this approach. Then on Thursday we can get into more details and
>>>> targeted points of discussion on this topic.
>>>>
>>>> Thanks,
>>>> Amogh Jahagirdar
>>>>
>>>> On Tue, Feb 17, 2026 at 9:27 PM Amogh Jahagirdar <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks Steven! I've set up some time next Thursday for the community
>>>>> to discuss this. We're also looking at how the content entry would look
>>>>> like in a combined DV with potential column updates for DV changes, and 
>>>>> how
>>>>> change detection could look like in this approach. I should have more to
>>>>> share on this by the time of the community discussion next week.
>>>>> We should also consider potential root churn and memory consumption
>>>>> stemming from expected root entry inflation due to a combined data file +
>>>>> DV entry with possible column updates for certain DV workloads; though at
>>>>> least for memory consumption of stats being held after planning, that
>>>>> arguably is an implementation problem for certain integrations.
>>>>>
>>>>> Thanks,
>>>>> Amogh Jahagirdar
>>>>>
>>>>> On Fri, Feb 13, 2026 at 10:58 AM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I wrote up some analysis with back-of-the-envelope calculations about
>>>>>> the column update approach for DV colocation. It mainly concerns the 2nd
>>>>>> use case: deleting a large number of rows from a small number of files.
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.gvdulzy486n7
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 4, 2026 at 1:02 AM Péter Váry <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I fully agree with Anton and Steven that we need benchmarks before
>>>>>>> choosing any direction.
>>>>>>>
>>>>>>> I ran some preliminary column‑stitching benchmarks last summer:
>>>>>>>
>>>>>>>    - Results are available in the doc:
>>>>>>>    
>>>>>>> https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww
>>>>>>>    - Code is here: https://github.com/apache/iceberg/pull/13306
>>>>>>>
>>>>>>> I’ve summarized the most relevant results at the end of this email.
>>>>>>> They show roughly a 10% slowdown on the read path with column stitching 
>>>>>>> in
>>>>>>> similar scenarios when using local SSDs. I expect that in real 
>>>>>>> deployments
>>>>>>> the metadata read cost will mostly be driven by blob I/O (assuming no
>>>>>>> caching). If blob access becomes the dominant factor in read latency,
>>>>>>> multithreaded fetching should be able to absorb the overhead introduced 
>>>>>>> by
>>>>>>> column stitching, resulting in latency similar to the single‑file layout
>>>>>>> (unless IO is already the bottleneck)
>>>>>>>
>>>>>>> We should definitely rerun the benchmarks once we have a clearer
>>>>>>> understanding of the intended usage patterns.
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>> The relevant(ish) results are for 100 columns, with 2 families with
>>>>>>> 50-50 columns and local read:
>>>>>>>
>>>>>>> The base is:
>>>>>>> MultiThreadedParquetBenchmark.read        100           0
>>>>>>>  false    ss   20   3.739 ±  0.096   s/op
>>>>>>>
>>>>>>> The read for single threaded:
>>>>>>> MultiThreadedParquetBenchmark.read        100           2
>>>>>>>  false    ss   20   4.036 ±  0.082   s/op
>>>>>>>
>>>>>>> The read for multi threaded:
>>>>>>> MultiThreadedParquetBenchmark.read        100           2
>>>>>>>   true    ss   20   4.063 ±  0.080   s/op
>>>>>>>
>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2026. febr. 3.,
>>>>>>> K, 23:27):
>>>>>>>
>>>>>>>>
>>>>>>>> I agree with Anton in this
>>>>>>>> <https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o/edit?disco=AAAByzDx21w>
>>>>>>>> comment thread that we probably need to run benchmarks for a few common
>>>>>>>> scenarios to guide this decision. We need to write down detailed plans 
>>>>>>>> for
>>>>>>>> those scenarios and what are we measuring. Also ideally, we want to 
>>>>>>>> measure
>>>>>>>> using the V4 metadata structure (like Parquet manifest file, column 
>>>>>>>> stats
>>>>>>>> structs, adaptive tree). There are PoC PRs available for column stats,
>>>>>>>> Parquet manifest, and root manifest. It would probably be tricky to 
>>>>>>>> piece
>>>>>>>> them together to run the benchmark considering the PoC status. We also 
>>>>>>>> need
>>>>>>>> the column stitching capability on the read path to test the column 
>>>>>>>> file
>>>>>>>> approach.
>>>>>>>>
>>>>>>>> On Tue, Feb 3, 2026 at 1:53 PM Anoop Johnson <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm in favor of co-located DV metadata with column file override
>>>>>>>>> and not doing affiliated/unaffiliated delete manifests. This is
>>>>>>>>> conceptually similar to strictly affiliated delete manifests with
>>>>>>>>> positional joins, and will halve the number of I/Os when there is no 
>>>>>>>>> DV
>>>>>>>>> column override. It is simpler to implement
>>>>>>>>> and will speed up reads.
>>>>>>>>>
>>>>>>>>> Unaffiliated DV manifests are flexible for writers. They reduce
>>>>>>>>> the chance of physical conflicts when there are concurrent 
>>>>>>>>> large/random
>>>>>>>>> deletes that change DVs on different files in the same manifest. But 
>>>>>>>>> the
>>>>>>>>> flexibility comes at a read-time cost. If the number of unaffiliated 
>>>>>>>>> DVs
>>>>>>>>> exceeds a threshold, it could cause driver OOMs or require 
>>>>>>>>> distributed join
>>>>>>>>> to pair up DVs with data files. With colocated metadata, manifest DVs 
>>>>>>>>> can
>>>>>>>>> reduce the chance of conflicts up to a certain write size.
>>>>>>>>>
>>>>>>>>> I assume we will still support unaffiliated manifests for equality
>>>>>>>>> deletes, but perhaps we can restrict it to just equality deletes.
>>>>>>>>>
>>>>>>>>> -Anoop
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 2, 2026 at 4:27 PM Anton Okolnychyi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I added the approach with column files to the doc.
>>>>>>>>>>
>>>>>>>>>> To sum up, separate data and delete manifests with affinity
>>>>>>>>>> would perform somewhat on par with co-located DV metadata (a.k.a. 
>>>>>>>>>> direct
>>>>>>>>>> assignment) if we add support for column files when we need to 
>>>>>>>>>> replace most
>>>>>>>>>> or all DVs (use case 1). That said, the support for direct 
>>>>>>>>>> assignment with
>>>>>>>>>> in-line metadata DVs can help us avoid unaffiliated delete manifests 
>>>>>>>>>> when
>>>>>>>>>> we need to replace a few DVs (use case 2).
>>>>>>>>>>
>>>>>>>>>> So the key question is whether we want to allow
>>>>>>>>>> unaffiliated delete manifests with DVs... If we don't, then we would 
>>>>>>>>>> likely
>>>>>>>>>> want to have co-located DV metadata and must support efficient column
>>>>>>>>>> updates not to regress compared to V2 and V3 for large MERGE jobs 
>>>>>>>>>> that
>>>>>>>>>> modify a small set of records for most files.
>>>>>>>>>>
>>>>>>>>>> пн, 2 лют. 2026 р. о 13:20 Anton Okolnychyi <
>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>
>>>>>>>>>>> Anoop, correct, if we keep data and delete manifests separate,
>>>>>>>>>>> there is a better way to combine the entries and we should NOT rely 
>>>>>>>>>>> on the
>>>>>>>>>>> referenced data file path. Reconciling by implicit position will 
>>>>>>>>>>> reduce the
>>>>>>>>>>> size of the DV entry (no need to store the referenced data file 
>>>>>>>>>>> path) and
>>>>>>>>>>> will improve the planning performance (no equals/hashCode on the 
>>>>>>>>>>> path).
>>>>>>>>>>>
>>>>>>>>>>> Steven, I agree. Most notes in the doc pre-date discussions we
>>>>>>>>>>> had on column updates. You are right, given that we are gravitating 
>>>>>>>>>>> towards
>>>>>>>>>>> a native way to handle column updates, it seems logical to use the 
>>>>>>>>>>> same
>>>>>>>>>>> approach for replacing DVs, since they’re essentially column 
>>>>>>>>>>> updates. Let
>>>>>>>>>>> me add one more approach to the doc based on what Anurag and Peter 
>>>>>>>>>>> have so
>>>>>>>>>>> far.
>>>>>>>>>>>
>>>>>>>>>>> нд, 1 лют. 2026 р. о 20:59 Steven Wu <[email protected]>
>>>>>>>>>>> пише:
>>>>>>>>>>>
>>>>>>>>>>>> Anton, thanks for raising this. I agree this deserves another
>>>>>>>>>>>> look. I added a comment in your doc that we can potentially apply 
>>>>>>>>>>>> the
>>>>>>>>>>>> column update proposal for data file update to the manifest file 
>>>>>>>>>>>> updates as
>>>>>>>>>>>> well, to colocate the data DV and data manifest files. Data DVs 
>>>>>>>>>>>> can be a
>>>>>>>>>>>> separate column in the data manifest file and updated separately 
>>>>>>>>>>>> in a
>>>>>>>>>>>> column file. This is the same as the coalesced positional join 
>>>>>>>>>>>> that Anoop
>>>>>>>>>>>> mentioned.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Feb 1, 2026 at 4:14 PM Anoop Johnson <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for raising this, Anton. I had a similar observation
>>>>>>>>>>>>> while prototyping
>>>>>>>>>>>>> <https://github.com/apache/iceberg/pull/14533> the
>>>>>>>>>>>>> adaptive metadata tree. The overhead of doing a path-based hash 
>>>>>>>>>>>>> join of a
>>>>>>>>>>>>> data manifest with the affiliated delete manifest is high: my 
>>>>>>>>>>>>> estimate was
>>>>>>>>>>>>> that the join adds about 5-10% overhead. The hash table 
>>>>>>>>>>>>> build/probe alone
>>>>>>>>>>>>> takes about 5 ms for manifests with 25K entries. There are 
>>>>>>>>>>>>> engines that can
>>>>>>>>>>>>> do vectorized hash joins that can lower this, but the overhead and
>>>>>>>>>>>>> complexity of a SIMD-friendly hash join is non-trivial.
>>>>>>>>>>>>>
>>>>>>>>>>>>> An alternative to relying on the external file feature in
>>>>>>>>>>>>> Parquet, is to make affiliated manifests order-preserving: ie DVs 
>>>>>>>>>>>>> in an
>>>>>>>>>>>>> affiliated delete manifest must appear in the same position as the
>>>>>>>>>>>>> corresponding data file in the data manifest the delete manifest 
>>>>>>>>>>>>> is
>>>>>>>>>>>>> affiliated to.  If a data file does not have a DV, the DV 
>>>>>>>>>>>>> manifest must
>>>>>>>>>>>>> store a NULL. This would allow us to do positional joins, which 
>>>>>>>>>>>>> are much
>>>>>>>>>>>>> faster. If we wanted, we could even have multiple affiliated DV 
>>>>>>>>>>>>> manifests
>>>>>>>>>>>>> for a data manifest and the reader would do a COALESCED 
>>>>>>>>>>>>> positional join
>>>>>>>>>>>>> (i.e. pick the first non-null value as the DV). It puts the 
>>>>>>>>>>>>> sorting
>>>>>>>>>>>>> responsibility to the writers, but it might be a reasonable 
>>>>>>>>>>>>> tradeoff.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, the options don't necessarily have to be mutually
>>>>>>>>>>>>> exclusive. We could still allow affiliated DVs to be "folded" 
>>>>>>>>>>>>> into data
>>>>>>>>>>>>> manifest (e.g. by background optimization jobs or the writer 
>>>>>>>>>>>>> itself). That
>>>>>>>>>>>>> might be the optimal choice for read-heavy tables because it will 
>>>>>>>>>>>>> halve the
>>>>>>>>>>>>> number of I/Os readers have to make.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Anoop
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jan 30, 2026 at 6:03 PM Anton Okolnychyi <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I had a chance to catch up on some of the V4 discussions.
>>>>>>>>>>>>>> Given that we are getting rid of the manifest list and switching 
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> Parquet, I wanted to re-evaluate the possibility of direct DV 
>>>>>>>>>>>>>> assignment
>>>>>>>>>>>>>> that we discarded in V3 to avoid regressions. I have put 
>>>>>>>>>>>>>> together my
>>>>>>>>>>>>>> thoughts in a doc [1].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> TL;DR:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - I think the current V4 proposal that keeps data and delete
>>>>>>>>>>>>>> manifests separate but introduces affinity is a solid choice for 
>>>>>>>>>>>>>> cases when
>>>>>>>>>>>>>> we need to replace DVs in many / most files. I outlined an 
>>>>>>>>>>>>>> approach with
>>>>>>>>>>>>>> column-split Parquet files but it doesn't improve the 
>>>>>>>>>>>>>> performance and takes
>>>>>>>>>>>>>> dependency on a portion of the Parquet spec that is not really 
>>>>>>>>>>>>>> implemented.
>>>>>>>>>>>>>> - Pushing unaffiliated DVs directly into the root to replace
>>>>>>>>>>>>>> a small set of DVs is going to be fast on write but does require 
>>>>>>>>>>>>>> resolving
>>>>>>>>>>>>>> where those DVs apply at read time. Using inline metadata DVs 
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> column-split Parquet files is a little more promising in this 
>>>>>>>>>>>>>> case as it
>>>>>>>>>>>>>> allows to avoid unaffiliated DVs. That said, it again relies on 
>>>>>>>>>>>>>> something
>>>>>>>>>>>>>> Parquet doesn't implement right now, requires changing 
>>>>>>>>>>>>>> maintenance
>>>>>>>>>>>>>> operations, and yields minimal benefits.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> All in all, the V4 proposal seems like a strict improvement
>>>>>>>>>>>>>> over V3 but I insist that we reconsider usage of the referenced 
>>>>>>>>>>>>>> data file
>>>>>>>>>>>>>> path when resolving DVs to data files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1] -
>>>>>>>>>>>>>> https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> сб, 22 лист. 2025 р. о 13:37 Amogh Jahagirdar <
>>>>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here is the meeting recording
>>>>>>>>>>>>>>> <https://drive.google.com/file/d/1lG9sM-JTwqcIgk7JsAryXXCc1vMnstJs/view?usp=sharing>
>>>>>>>>>>>>>>>  and generated meeting summary
>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1e50p8TXL2e3CnUwKMOvm8F4s2PeVMiKWHPxhxOW1fIM/edit?usp=sharing>.
>>>>>>>>>>>>>>> Thanks all for attending yesterday!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Nov 20, 2025 at 8:49 AM Amogh Jahagirdar <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was out for some time, but set up a sync for tomorrow at
>>>>>>>>>>>>>>>> 9am PST. For this discussion, I do think it would be great to 
>>>>>>>>>>>>>>>> focus on the
>>>>>>>>>>>>>>>> manifest DV representation, factoring in analyses on bitmap 
>>>>>>>>>>>>>>>> representation
>>>>>>>>>>>>>>>> storage footprints, and the entry structure considering how we 
>>>>>>>>>>>>>>>> want to
>>>>>>>>>>>>>>>> approach change detection. If there are other topics that 
>>>>>>>>>>>>>>>> people want to
>>>>>>>>>>>>>>>> highlight, please do bring those up as well!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I also recognize that this is a bit short term scheduling,
>>>>>>>>>>>>>>>> so please do reach out to me if this time is difficult to work 
>>>>>>>>>>>>>>>> with; next
>>>>>>>>>>>>>>>> week is the Thanksgiving holidays here, and since people would 
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> travelling/out I figured I'd try to schedule before then.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Oct 17, 2025 at 9:03 AM Amogh Jahagirdar <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sorry for the delay, here's the recording link
>>>>>>>>>>>>>>>>> <https://drive.google.com/file/d/1YOmPROXjAKYAWAcYxqAFHdADbqELVVf2/view>
>>>>>>>>>>>>>>>>>   from
>>>>>>>>>>>>>>>>> last week's discussion.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Oct 10, 2025 at 9:44 AM Péter Váry <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Same here.
>>>>>>>>>>>>>>>>>> Please record if you can.
>>>>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Oct 10, 2025, 17:39 Fokko Driesprong <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey Amogh,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for the write-up. Unfortunately, I won’t be able
>>>>>>>>>>>>>>>>>>> to attend. Will it be recorded? Thanks!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Op di 7 okt 2025 om 20:36 schreef Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I've setup time this Friday at 9am PST for another sync
>>>>>>>>>>>>>>>>>>>> on single file commits. In terms of what would be great to 
>>>>>>>>>>>>>>>>>>>> focus on for the
>>>>>>>>>>>>>>>>>>>> discussion:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1. Whether it makes sense or not to eliminate the
>>>>>>>>>>>>>>>>>>>> tuple, and instead representing the tuple via lower/upper 
>>>>>>>>>>>>>>>>>>>> boundaries. As a
>>>>>>>>>>>>>>>>>>>> reminder, one of the goals is to avoid tying a partition 
>>>>>>>>>>>>>>>>>>>> spec to a
>>>>>>>>>>>>>>>>>>>> manifest; in the root we can have a mix of files spanning 
>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>> partition specs, and even in leaf manifests avoiding this 
>>>>>>>>>>>>>>>>>>>> coupling can
>>>>>>>>>>>>>>>>>>>> enable more desirable clustering of metadata.
>>>>>>>>>>>>>>>>>>>> In the vast majority of cases, we could leverage the
>>>>>>>>>>>>>>>>>>>> property that a file is effectively partitioned if the 
>>>>>>>>>>>>>>>>>>>> lower/upper for a
>>>>>>>>>>>>>>>>>>>> given field is equal. The nuance here is with the 
>>>>>>>>>>>>>>>>>>>> particular case of
>>>>>>>>>>>>>>>>>>>> identity partitioned string/binary columns which can be 
>>>>>>>>>>>>>>>>>>>> truncated in stats.
>>>>>>>>>>>>>>>>>>>> One approach is to require that writers must not produce 
>>>>>>>>>>>>>>>>>>>> truncated stats
>>>>>>>>>>>>>>>>>>>> for identity partitioned columns. It's also important to 
>>>>>>>>>>>>>>>>>>>> keep in mind that
>>>>>>>>>>>>>>>>>>>> all of this is just for the purpose of reconstructing the 
>>>>>>>>>>>>>>>>>>>> partition tuple,
>>>>>>>>>>>>>>>>>>>> which is only required during equality delete matching. 
>>>>>>>>>>>>>>>>>>>> Another area we
>>>>>>>>>>>>>>>>>>>> need to cover as part of this is on exact bounds on stats. 
>>>>>>>>>>>>>>>>>>>> There are other
>>>>>>>>>>>>>>>>>>>> options here as well such as making all new equality 
>>>>>>>>>>>>>>>>>>>> deletes in V4 be
>>>>>>>>>>>>>>>>>>>> global and instead match based on bounds, or keeping the 
>>>>>>>>>>>>>>>>>>>> tuple but each
>>>>>>>>>>>>>>>>>>>> tuple is effectively based off a union schema of all 
>>>>>>>>>>>>>>>>>>>> partition specs. I am
>>>>>>>>>>>>>>>>>>>> adding a separate appendix section outlining the span of 
>>>>>>>>>>>>>>>>>>>> options here and
>>>>>>>>>>>>>>>>>>>> the different tradeoffs.
>>>>>>>>>>>>>>>>>>>> Once we get this more to a conclusive state, I'll move
>>>>>>>>>>>>>>>>>>>> a summarized version to the main doc.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2. @[email protected] <[email protected]> has
>>>>>>>>>>>>>>>>>>>> updated the doc with a section
>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.rrpksmp8zkb#heading=h.qau0y5xkh9mn>
>>>>>>>>>>>>>>>>>>>>  on
>>>>>>>>>>>>>>>>>>>> how we can do change detection from the root in a variety 
>>>>>>>>>>>>>>>>>>>> of write
>>>>>>>>>>>>>>>>>>>> scenarios. I've done a review on it, and it covers the 
>>>>>>>>>>>>>>>>>>>> cases I would
>>>>>>>>>>>>>>>>>>>> expect. It'd be good for folks to take a look and please 
>>>>>>>>>>>>>>>>>>>> give feedback
>>>>>>>>>>>>>>>>>>>> before we discuss. Thank you Steven for adding that 
>>>>>>>>>>>>>>>>>>>> section and all the
>>>>>>>>>>>>>>>>>>>> diagrams.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Sep 18, 2025 at 3:19 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hey folks just following up from the discussion last
>>>>>>>>>>>>>>>>>>>>> Friday with a summary and some next steps:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1.) For the various change detection cases, we
>>>>>>>>>>>>>>>>>>>>> concluded it's best just to go through those in an 
>>>>>>>>>>>>>>>>>>>>> offline manner on the
>>>>>>>>>>>>>>>>>>>>> doc since it's hard to verify all that correctness in a 
>>>>>>>>>>>>>>>>>>>>> large meeting
>>>>>>>>>>>>>>>>>>>>> setting.
>>>>>>>>>>>>>>>>>>>>> 2.) We mostly discussed eliminating the
>>>>>>>>>>>>>>>>>>>>> partition tuple. On the original proposal, I was mostly 
>>>>>>>>>>>>>>>>>>>>> aiming for the
>>>>>>>>>>>>>>>>>>>>> ability to re-constructing the tuple from the stats for 
>>>>>>>>>>>>>>>>>>>>> the purpose of
>>>>>>>>>>>>>>>>>>>>> equality delete matching (a file is partitioned if the 
>>>>>>>>>>>>>>>>>>>>> lower and upper
>>>>>>>>>>>>>>>>>>>>> bounds are equal); There's some nuance in how we need to 
>>>>>>>>>>>>>>>>>>>>> handle identity
>>>>>>>>>>>>>>>>>>>>> partition values since for string/binary they cannot be 
>>>>>>>>>>>>>>>>>>>>> truncated.
>>>>>>>>>>>>>>>>>>>>> Another potential option is to treat all equality deletes 
>>>>>>>>>>>>>>>>>>>>> as effectively
>>>>>>>>>>>>>>>>>>>>> global and narrow their application based on the stats 
>>>>>>>>>>>>>>>>>>>>> values. This may
>>>>>>>>>>>>>>>>>>>>> require defining tight bounds. I'm still collecting my 
>>>>>>>>>>>>>>>>>>>>> thoughts on this one.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks folks! Please also let me know if any of the
>>>>>>>>>>>>>>>>>>>>> following links are inaccessible for any reason.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Meeting recording link:
>>>>>>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Meeting summary:
>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/131N0CDpzZczURxitN0HGS7dTqRxQT_YS9jMECkGGvQU
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 3:40 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Update: I moved the discussion time to this Friday at
>>>>>>>>>>>>>>>>>>>>>> 9 am PST since I found out that quite a few folks 
>>>>>>>>>>>>>>>>>>>>>> involved in the proposals
>>>>>>>>>>>>>>>>>>>>>> will be out next week, and I also know some folks will 
>>>>>>>>>>>>>>>>>>>>>> also be out the week
>>>>>>>>>>>>>>>>>>>>>> after that.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Amogh J
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 8:57 AM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hey folks sorry for the late follow up here,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks @Kevin Liu <[email protected]> for
>>>>>>>>>>>>>>>>>>>>>>> sharing the recording link of the previous discussion! 
>>>>>>>>>>>>>>>>>>>>>>> I've set up another
>>>>>>>>>>>>>>>>>>>>>>> sync for next Tuesday 09/16 at 9am PST. This time I've 
>>>>>>>>>>>>>>>>>>>>>>> set it up from my
>>>>>>>>>>>>>>>>>>>>>>> corporate email so we can get recordings and 
>>>>>>>>>>>>>>>>>>>>>>> transcriptions (and I've made
>>>>>>>>>>>>>>>>>>>>>>> sure to keep the meeting invite open so we don't have 
>>>>>>>>>>>>>>>>>>>>>>> to manually let
>>>>>>>>>>>>>>>>>>>>>>> people in).
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> In terms of next steps of areas which I think would
>>>>>>>>>>>>>>>>>>>>>>> be good to focus on for establishing consensus:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 1. How do we model the manifest entry structure
>>>>>>>>>>>>>>>>>>>>>>> so that changes to manifest DVs can be obtained easily 
>>>>>>>>>>>>>>>>>>>>>>> from the root? There
>>>>>>>>>>>>>>>>>>>>>>> are a few options here; the most promising approach is 
>>>>>>>>>>>>>>>>>>>>>>> to keep an
>>>>>>>>>>>>>>>>>>>>>>> additional DV which encodes the diff in additional 
>>>>>>>>>>>>>>>>>>>>>>> positions which have
>>>>>>>>>>>>>>>>>>>>>>> been removed from a leaf manifest.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 2. Modeling partition transforms via expressions and
>>>>>>>>>>>>>>>>>>>>>>> establishing a unified table ID space so that we can 
>>>>>>>>>>>>>>>>>>>>>>> simplify how partition
>>>>>>>>>>>>>>>>>>>>>>> tuples may be represented via stats and also have a way 
>>>>>>>>>>>>>>>>>>>>>>> in the future to
>>>>>>>>>>>>>>>>>>>>>>> store stats on any derived column. I have a short
>>>>>>>>>>>>>>>>>>>>>>> proposal
>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1oV8dapKVzB4pZy5pKHUCj5j9i2_1p37BJSeT7hyKPpg/edit?tab=t.0>
>>>>>>>>>>>>>>>>>>>>>>>  for
>>>>>>>>>>>>>>>>>>>>>>> this that probably still needs some tightening up on 
>>>>>>>>>>>>>>>>>>>>>>> the expression
>>>>>>>>>>>>>>>>>>>>>>> modeling itself (and some prototyping) but the general 
>>>>>>>>>>>>>>>>>>>>>>> idea for
>>>>>>>>>>>>>>>>>>>>>>> establishing a unified table ID space is covered. All 
>>>>>>>>>>>>>>>>>>>>>>> feedback welcome!
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Aug 25, 2025 at 1:34 PM Kevin Liu <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks Amogh. Looks like the recording for last
>>>>>>>>>>>>>>>>>>>>>>>> week's sync is available on Youtube. Here's the link,
>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=uWm-p--8oVQ
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>> Kevin Liu
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Aug 12, 2025 at 9:10 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Just following up on this to give the community as
>>>>>>>>>>>>>>>>>>>>>>>>> to where we're at and my proposed next steps.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I've been editing and merging the contents from
>>>>>>>>>>>>>>>>>>>>>>>>> our proposal into the proposal
>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw>
>>>>>>>>>>>>>>>>>>>>>>>>>  from
>>>>>>>>>>>>>>>>>>>>>>>>> Russell and others. For any future comments on docs, 
>>>>>>>>>>>>>>>>>>>>>>>>> please comment on the
>>>>>>>>>>>>>>>>>>>>>>>>> linked proposal. I've also marked it on our doc in 
>>>>>>>>>>>>>>>>>>>>>>>>> red text so it's clear
>>>>>>>>>>>>>>>>>>>>>>>>> to redirect to the other proposal as a source of 
>>>>>>>>>>>>>>>>>>>>>>>>> truth for comments.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> In terms of next steps,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 1. An important design decision point is around
>>>>>>>>>>>>>>>>>>>>>>>>> inline manifest DVs, external manifest DVs or 
>>>>>>>>>>>>>>>>>>>>>>>>> enabling both. I'm working on
>>>>>>>>>>>>>>>>>>>>>>>>> measuring different approaches for representing the 
>>>>>>>>>>>>>>>>>>>>>>>>> compressed DV
>>>>>>>>>>>>>>>>>>>>>>>>> representation since that will inform how many 
>>>>>>>>>>>>>>>>>>>>>>>>> entries can reasonably fit
>>>>>>>>>>>>>>>>>>>>>>>>> in a small root manifest; from that we can derive 
>>>>>>>>>>>>>>>>>>>>>>>>> implications on different
>>>>>>>>>>>>>>>>>>>>>>>>> write patterns and determine the right approach for 
>>>>>>>>>>>>>>>>>>>>>>>>> storing these manifest
>>>>>>>>>>>>>>>>>>>>>>>>> DVs.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 2. Another key point is around determining if/how
>>>>>>>>>>>>>>>>>>>>>>>>> we can reasonably enable V4 to represent changes in 
>>>>>>>>>>>>>>>>>>>>>>>>> the root manifest so
>>>>>>>>>>>>>>>>>>>>>>>>> that readers can effectively just infer file level 
>>>>>>>>>>>>>>>>>>>>>>>>> changes from the root.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 3. One of the aspects of the proposal is getting
>>>>>>>>>>>>>>>>>>>>>>>>> away from partition tuple requirement in the root 
>>>>>>>>>>>>>>>>>>>>>>>>> which currently holds us
>>>>>>>>>>>>>>>>>>>>>>>>> to have associativity between a partition spec and a 
>>>>>>>>>>>>>>>>>>>>>>>>> manifest. These
>>>>>>>>>>>>>>>>>>>>>>>>> aspects can be modeled as essentially column stats 
>>>>>>>>>>>>>>>>>>>>>>>>> which gives a lot of
>>>>>>>>>>>>>>>>>>>>>>>>> flexibility into the organization of the manifest. 
>>>>>>>>>>>>>>>>>>>>>>>>> There are important
>>>>>>>>>>>>>>>>>>>>>>>>> details around field ID spaces here which tie into 
>>>>>>>>>>>>>>>>>>>>>>>>> how the stats are
>>>>>>>>>>>>>>>>>>>>>>>>> structured. What we're proposing here is to have a 
>>>>>>>>>>>>>>>>>>>>>>>>> unified expression ID
>>>>>>>>>>>>>>>>>>>>>>>>> space that could also benefit us for storing things 
>>>>>>>>>>>>>>>>>>>>>>>>> like virtual columns
>>>>>>>>>>>>>>>>>>>>>>>>> down the line. I go into this in the proposal but I'm 
>>>>>>>>>>>>>>>>>>>>>>>>> working on separating
>>>>>>>>>>>>>>>>>>>>>>>>> the appropriate parts so that the original proposal 
>>>>>>>>>>>>>>>>>>>>>>>>> can mostly just focus
>>>>>>>>>>>>>>>>>>>>>>>>> on the organization of the content metadata tree and 
>>>>>>>>>>>>>>>>>>>>>>>>> not how we want to
>>>>>>>>>>>>>>>>>>>>>>>>> solve this particular ID space problem.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 4. I'm planning on scheduling a recurring
>>>>>>>>>>>>>>>>>>>>>>>>> community sync starting next Tuesday at 9am PST, 
>>>>>>>>>>>>>>>>>>>>>>>>> every 2 weeks. If I get
>>>>>>>>>>>>>>>>>>>>>>>>> feedback from folks that this time will never work, I 
>>>>>>>>>>>>>>>>>>>>>>>>> can certainly adjust.
>>>>>>>>>>>>>>>>>>>>>>>>> For some reason, I don't have the ability to add to 
>>>>>>>>>>>>>>>>>>>>>>>>> the Iceberg Dev
>>>>>>>>>>>>>>>>>>>>>>>>> calendar, so I'll figure that out and update the 
>>>>>>>>>>>>>>>>>>>>>>>>> thread when the event is
>>>>>>>>>>>>>>>>>>>>>>>>> scheduled.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 11:47 AM Russell Spitzer <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I think this is a great way forward, starting out
>>>>>>>>>>>>>>>>>>>>>>>>>> with this much parallel development shows that we 
>>>>>>>>>>>>>>>>>>>>>>>>>> have a lot of consensus
>>>>>>>>>>>>>>>>>>>>>>>>>> already :)
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 12:42 PM Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey folks, just following up on this. It looks
>>>>>>>>>>>>>>>>>>>>>>>>>>> like our proposal and the proposal that @Russell
>>>>>>>>>>>>>>>>>>>>>>>>>>> Spitzer <[email protected]> shared are
>>>>>>>>>>>>>>>>>>>>>>>>>>> pretty aligned. I was just chatting with Russell 
>>>>>>>>>>>>>>>>>>>>>>>>>>> about this, and we think
>>>>>>>>>>>>>>>>>>>>>>>>>>> it'd be best to combine both proposals and have a 
>>>>>>>>>>>>>>>>>>>>>>>>>>> singular large effort on
>>>>>>>>>>>>>>>>>>>>>>>>>>> this. I can also set up a focused community 
>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion (similar to what
>>>>>>>>>>>>>>>>>>>>>>>>>>> we're doing on the other V4 proposals) on this 
>>>>>>>>>>>>>>>>>>>>>>>>>>> starting sometime next week
>>>>>>>>>>>>>>>>>>>>>>>>>>> just to get things moving, if that works for people.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 9:48 PM Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Russell,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for sharing the proposal! A few of us
>>>>>>>>>>>>>>>>>>>>>>>>>>>> (Ryan, Dan, Anoop and I) have also been working on 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a proposal for an
>>>>>>>>>>>>>>>>>>>>>>>>>>>> adaptive metadata tree structure as part of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> enabling more efficient one
>>>>>>>>>>>>>>>>>>>>>>>>>>>> file commits. From a read of the summary, it's 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> great to see that we're
>>>>>>>>>>>>>>>>>>>>>>>>>>>> thinking along the same lines about how to tackle 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this fundamental area!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is our proposal:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 8:08 PM Russell Spitzer
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey y'all!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We (Yi Fang, Steven Wu and Myself) wanted to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> share some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the thoughts we had on how one-file commits
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could work in Iceberg. This is pretty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> much just a high level overview of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concepts we think we need and how Iceberg would 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behave.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We haven't gone very far into the actual
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation and changes that would need to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> occur in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SDK to make this happen.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The high level summary is:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Manifest Lists are out
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Root Manifests take their place
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   A Root manifest can have data manifests,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delete manifests, manifest delete vectors, data 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delete vectors and data
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   Manifest delete vectors allow for modifying
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a manifest without deleting it entirely
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   Data files let you append without writing an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediary manifest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   Having child data and delete manifests lets
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you still scale
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please take a look if you like,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm excited to see what other proposals and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideas are floating around the community,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Russ
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Very excited about the idea!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop Johnson <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm very interested in this initiative.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Micah Kornfield and I presented
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on high-throughput ingestion for Iceberg tables 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the 2024 Iceberg Summit,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which leveraged Google infrastructure like 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Colossus for efficient appends.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This new proposal is particularly exciting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it offers significant advancements in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> commit latency and metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> storage footprint. Furthermore, a consistent 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifest structure promises to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simplify the design and codebase, which is a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> major benefit.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A related idea I've been exploring is having
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a loose affinity between data and delete 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifests. While the current
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> separation of data and delete manifests in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg is valuable for avoiding
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data file rewrites (and stats updates) when 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> deletes change, it does
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> necessitate a join operation during reads. I'd 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be keen to discuss
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> approaches that could potentially reduce this 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> read-side cost while
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> retaining the benefits of separate manifests.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anoop
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sidhu <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am new to the Iceberg community but would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> love to participate in these discussions to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce the number of file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> writes, especially for small writes/commits.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Jagdeep
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Mantripragada 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We have been hitting all the metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problems you mentioned, Ryan. I’m on-board to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help however I can to improve
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this area.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~ Anurag Mantripragada
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheng <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am interested in this idea and looking
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> forward to collaboration.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Huang-Hsiang
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am interested in contributing to this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> effort.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Namratha
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jahagirdar <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for kicking this thread off Ryan,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm interested in helping out here! I've 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> been working on a proposal in this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> area and it would be great to collaborate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with different folks and exchange
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ideas here, since I think a lot of people 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are interested in solving this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Like Russell’s recent note, I’m starting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a thread to connect those of us that are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in the idea of changing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg’s metadata in v4 so that in most 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cases committing a change only
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requires writing one additional metadata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *Idea: One-file commits*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The current Iceberg metadata structure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requires writing at least one manifest and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a new manifest list to produce a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> new snapshot. The goal of this work is to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> allow more flexibility by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> allowing the manifest list layer to store 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data and delete files. As a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> result, only one file write would be needed 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> before committing the new
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> snapshot. In addition, this work will also 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try to explore:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Avoiding small manifests that must
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    be read in parallel and later compacted 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (metadata maintenance changes)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Extend metadata skipping to use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    aggregated column ranges that are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatible with geospatial data (manifest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    metadata)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Using soft deletes to avoid
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    rewriting existing manifests (metadata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DVs)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you’re interested in these problems,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> please reply!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] v4 - One file commits

Reply via email to