Re: [DISCUSS] v4 - One file commits

Kevin Liu Mon, 30 Mar 2026 09:55:23 -0700

Done https://youtu.be/IVPHvZcJ07Q


Amogh, I also added your gmail as an owner for the Youtube channel.

On Mon, Mar 30, 2026 at 8:32 AM Steven Wu <[email protected]> wrote:

> Amogh, can you upload the video to the YouTube channel?
> https://www.youtube.com/playlist?list=PLkifVhhWtccxt1TE7w_HbNGhY5gpDTaX7
>
> On Mon, Mar 30, 2026 at 8:28 AM Amogh Jahagirdar <[email protected]> wrote:
>
>> Hey a few folks reached out indicating that I didn't properly share the
>> last v4 metadata tree meeting recording. So sorry about that! Here's the
>> link
>> <https://drive.google.com/file/d/1LhDL0Iy8YR4RN_W3D8APOUtkSBYk61fD/view?usp=drive_link>
>>  ,
>> do let me know if there are still issues.
>>
>> On Tue, Mar 3, 2026 at 9:17 AM Steven Wu <[email protected]> wrote:
>>
>>> My takeaway from the conversation is also that we don't need row-level
>>> column updates. Manifest DV can be used for row-level updates instead.
>>> Basically, a file (manifest or data) can be updated via (1) delete vector +
>>> updated rows in a new file (2) column file overlay. Depends on the
>>> percentage of modified rows, engines can choose which way to go.
>>>
>>> On Tue, Mar 3, 2026 at 6:24 AM Gábor Kaszab <[email protected]>
>>> wrote:
>>>
>>>> Thanks for the summary, Micah! I tried to watch the recording linked to
>>>> the calendar event, but apparently I don't have permission to do so. Not
>>>> sure about others.
>>>>
>>>> So if 'm not mistaken, one way to reduce the write cost of an UPDATE
>>>> for colocated DVs is to use the column updates. As I see there was some
>>>> agreement that row-level partial column updates aren't desired, and we aim
>>>> for at least file-level column updates. This is very useful information for
>>>> the other conversation
>>>> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>
>>>> going on for the column update proposal. We can bring this up on the column
>>>> update sync tomorrow, but I'm wondering if the consensus on avoiding
>>>> row-level column updates is something we can incorporate into the column
>>>> update proposal too or if it's something still up to debate.
>>>>
>>>> Best Regards,
>>>> Gabor
>>>>
>>>> Micah Kornfield <[email protected]> ezt írta (időpont: 2026. febr.
>>>> 25., Sze, 22:30):
>>>>
>>>>> Just wanted to summarize my main takeaways of Monday's sync.
>>>>>
>>>>> The approach will always collocate DVs with the data files (i.e. every
>>>>> data file row in a manifest has an optional DV reference).  This implies
>>>>> that there is not a separate "Deletion manifest".  Rather in V4 all
>>>>> manifests are "combined" where data files and DVs are colocated.
>>>>>
>>>>> Write amplification is avoided in two ways:
>>>>> 1.  For small updates we will need to  carry through metadata
>>>>> statistics (and other relevant data file fields) in memory (rescanning
>>>>> these is likely two expensive).    Once updates are available they will be
>>>>> written out a new manifest (either root or leaf) and use metadata DVs to
>>>>> remove the old rows.
>>>>> 2.  For larger updates we will only carry through the DV update parts
>>>>> in memory and use column level updates to replace existing DVs (this would
>>>>> require rescanning the DV columns for any updated manifest to merge with
>>>>> the updated DVs in memory, and then writing out the column update). The
>>>>> consensus on the call is that we didn't want to support partial  column
>>>>> updates (a.k.a. merge-on-read column updates).
>>>>>
>>>>> The idea is that engines would decide which path to follow based on
>>>>> the number of affected files.
>>>>>
>>>>> To help understand the implications of the new proposal, I put
>>>>> together a quick spreadsheet [1] to analyze trade-offs between separate
>>>>> deletion manifests and the new approach under scenario 1 and 2.  This
>>>>> represents the worst case scenario where file updates are uniformly
>>>>> distributed across a single update operation.  It does not account for
>>>>> repeated writes (e.g. on-going compaction).  My main take-aways is that
>>>>> keeping at most 1 affiliated DV separate might still help (akin to a merge
>>>>> on read column update), but maybe not enough relative to other parts of 
>>>>> the
>>>>> system (e.g. the churn on data files) that the complexity.
>>>>>
>>>>> Hope this is helpful.
>>>>>
>>>>> Micah
>>>>>
>>>>> [1]
>>>>> https://docs.google.com/spreadsheets/d/1klZQxV7ST2C-p9LTMmai_5rtFiyupj6jSLRPRkdI-u8/edit?gid=0#gid=0
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 19, 2026 at 3:52 PM Amogh Jahagirdar <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey folks, I've set up an additional initial discussion on DVs for
>>>>>> Monday. This topic is fairly complex and there is also now a free 
>>>>>> calendar
>>>>>> slot. I think it'd be helpful for us to first make sure we're all on the
>>>>>> same page in terms of what the approach proposed by Anton earlier in the
>>>>>> thread means and the high level mechanics. I should also have more to 
>>>>>> share
>>>>>> on the doc about how the entry structure and change detection could look
>>>>>> like in this approach. Then on Thursday we can get into more details and
>>>>>> targeted points of discussion on this topic.
>>>>>>
>>>>>> Thanks,
>>>>>> Amogh Jahagirdar
>>>>>>
>>>>>> On Tue, Feb 17, 2026 at 9:27 PM Amogh Jahagirdar <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Steven! I've set up some time next Thursday for the community
>>>>>>> to discuss this. We're also looking at how the content entry would look
>>>>>>> like in a combined DV with potential column updates for DV changes, and 
>>>>>>> how
>>>>>>> change detection could look like in this approach. I should have more to
>>>>>>> share on this by the time of the community discussion next week.
>>>>>>> We should also consider potential root churn and memory consumption
>>>>>>> stemming from expected root entry inflation due to a combined data file 
>>>>>>> +
>>>>>>> DV entry with possible column updates for certain DV workloads; though 
>>>>>>> at
>>>>>>> least for memory consumption of stats being held after planning, that
>>>>>>> arguably is an implementation problem for certain integrations.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Amogh Jahagirdar
>>>>>>>
>>>>>>> On Fri, Feb 13, 2026 at 10:58 AM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I wrote up some analysis with back-of-the-envelope calculations
>>>>>>>> about the column update approach for DV colocation. It mainly concerns 
>>>>>>>> the
>>>>>>>> 2nd use case: deleting a large number of rows from a small number of 
>>>>>>>> files.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.gvdulzy486n7
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 4, 2026 at 1:02 AM Péter Váry <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I fully agree with Anton and Steven that we need benchmarks before
>>>>>>>>> choosing any direction.
>>>>>>>>>
>>>>>>>>> I ran some preliminary column‑stitching benchmarks last summer:
>>>>>>>>>
>>>>>>>>>    - Results are available in the doc:
>>>>>>>>>    
>>>>>>>>> https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww
>>>>>>>>>    - Code is here: https://github.com/apache/iceberg/pull/13306
>>>>>>>>>
>>>>>>>>> I’ve summarized the most relevant results at the end of this
>>>>>>>>> email. They show roughly a 10% slowdown on the read path with column
>>>>>>>>> stitching in similar scenarios when using local SSDs. I expect that 
>>>>>>>>> in real
>>>>>>>>> deployments the metadata read cost will mostly be driven by blob I/O
>>>>>>>>> (assuming no caching). If blob access becomes the dominant factor in 
>>>>>>>>> read
>>>>>>>>> latency, multithreaded fetching should be able to absorb the overhead
>>>>>>>>> introduced by column stitching, resulting in latency similar to the
>>>>>>>>> single‑file layout (unless IO is already the bottleneck)
>>>>>>>>>
>>>>>>>>> We should definitely rerun the benchmarks once we have a clearer
>>>>>>>>> understanding of the intended usage patterns.
>>>>>>>>> Thanks,
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The relevant(ish) results are for 100 columns, with 2 families
>>>>>>>>> with 50-50 columns and local read:
>>>>>>>>>
>>>>>>>>> The base is:
>>>>>>>>> MultiThreadedParquetBenchmark.read        100           0
>>>>>>>>>    false    ss   20   3.739 ±  0.096   s/op
>>>>>>>>>
>>>>>>>>> The read for single threaded:
>>>>>>>>> MultiThreadedParquetBenchmark.read        100           2
>>>>>>>>>    false    ss   20   4.036 ±  0.082   s/op
>>>>>>>>>
>>>>>>>>> The read for multi threaded:
>>>>>>>>> MultiThreadedParquetBenchmark.read        100           2
>>>>>>>>>     true    ss   20   4.063 ±  0.080   s/op
>>>>>>>>>
>>>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2026. febr.
>>>>>>>>> 3., K, 23:27):
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I agree with Anton in this
>>>>>>>>>> <https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o/edit?disco=AAAByzDx21w>
>>>>>>>>>> comment thread that we probably need to run benchmarks for a few 
>>>>>>>>>> common
>>>>>>>>>> scenarios to guide this decision. We need to write down detailed 
>>>>>>>>>> plans for
>>>>>>>>>> those scenarios and what are we measuring. Also ideally, we want to 
>>>>>>>>>> measure
>>>>>>>>>> using the V4 metadata structure (like Parquet manifest file, column 
>>>>>>>>>> stats
>>>>>>>>>> structs, adaptive tree). There are PoC PRs available for column 
>>>>>>>>>> stats,
>>>>>>>>>> Parquet manifest, and root manifest. It would probably be tricky to 
>>>>>>>>>> piece
>>>>>>>>>> them together to run the benchmark considering the PoC status. We 
>>>>>>>>>> also need
>>>>>>>>>> the column stitching capability on the read path to test the column 
>>>>>>>>>> file
>>>>>>>>>> approach.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 3, 2026 at 1:53 PM Anoop Johnson <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm in favor of co-located DV metadata with column file override
>>>>>>>>>>> and not doing affiliated/unaffiliated delete manifests. This is
>>>>>>>>>>> conceptually similar to strictly affiliated delete manifests with
>>>>>>>>>>> positional joins, and will halve the number of I/Os when there is 
>>>>>>>>>>> no DV
>>>>>>>>>>> column override. It is simpler to implement
>>>>>>>>>>> and will speed up reads.
>>>>>>>>>>>
>>>>>>>>>>> Unaffiliated DV manifests are flexible for writers. They reduce
>>>>>>>>>>> the chance of physical conflicts when there are concurrent 
>>>>>>>>>>> large/random
>>>>>>>>>>> deletes that change DVs on different files in the same manifest. 
>>>>>>>>>>> But the
>>>>>>>>>>> flexibility comes at a read-time cost. If the number of 
>>>>>>>>>>> unaffiliated DVs
>>>>>>>>>>> exceeds a threshold, it could cause driver OOMs or require 
>>>>>>>>>>> distributed join
>>>>>>>>>>> to pair up DVs with data files. With colocated metadata, manifest 
>>>>>>>>>>> DVs can
>>>>>>>>>>> reduce the chance of conflicts up to a certain write size.
>>>>>>>>>>>
>>>>>>>>>>> I assume we will still support unaffiliated manifests for
>>>>>>>>>>> equality deletes, but perhaps we can restrict it to just equality 
>>>>>>>>>>> deletes.
>>>>>>>>>>>
>>>>>>>>>>> -Anoop
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 2, 2026 at 4:27 PM Anton Okolnychyi <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I added the approach with column files to the doc.
>>>>>>>>>>>>
>>>>>>>>>>>> To sum up, separate data and delete manifests with affinity
>>>>>>>>>>>> would perform somewhat on par with co-located DV metadata (a.k.a. 
>>>>>>>>>>>> direct
>>>>>>>>>>>> assignment) if we add support for column files when we need to 
>>>>>>>>>>>> replace most
>>>>>>>>>>>> or all DVs (use case 1). That said, the support for direct 
>>>>>>>>>>>> assignment with
>>>>>>>>>>>> in-line metadata DVs can help us avoid unaffiliated delete 
>>>>>>>>>>>> manifests when
>>>>>>>>>>>> we need to replace a few DVs (use case 2).
>>>>>>>>>>>>
>>>>>>>>>>>> So the key question is whether we want to allow
>>>>>>>>>>>> unaffiliated delete manifests with DVs... If we don't, then we 
>>>>>>>>>>>> would likely
>>>>>>>>>>>> want to have co-located DV metadata and must support efficient 
>>>>>>>>>>>> column
>>>>>>>>>>>> updates not to regress compared to V2 and V3 for large MERGE jobs 
>>>>>>>>>>>> that
>>>>>>>>>>>> modify a small set of records for most files.
>>>>>>>>>>>>
>>>>>>>>>>>> пн, 2 лют. 2026 р. о 13:20 Anton Okolnychyi <
>>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>>
>>>>>>>>>>>>> Anoop, correct, if we keep data and delete manifests separate,
>>>>>>>>>>>>> there is a better way to combine the entries and we should NOT 
>>>>>>>>>>>>> rely on the
>>>>>>>>>>>>> referenced data file path. Reconciling by implicit position will 
>>>>>>>>>>>>> reduce the
>>>>>>>>>>>>> size of the DV entry (no need to store the referenced data file 
>>>>>>>>>>>>> path) and
>>>>>>>>>>>>> will improve the planning performance (no equals/hashCode on the 
>>>>>>>>>>>>> path).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Steven, I agree. Most notes in the doc pre-date discussions we
>>>>>>>>>>>>> had on column updates. You are right, given that we are 
>>>>>>>>>>>>> gravitating towards
>>>>>>>>>>>>> a native way to handle column updates, it seems logical to use 
>>>>>>>>>>>>> the same
>>>>>>>>>>>>> approach for replacing DVs, since they’re essentially column 
>>>>>>>>>>>>> updates. Let
>>>>>>>>>>>>> me add one more approach to the doc based on what Anurag and 
>>>>>>>>>>>>> Peter have so
>>>>>>>>>>>>> far.
>>>>>>>>>>>>>
>>>>>>>>>>>>> нд, 1 лют. 2026 р. о 20:59 Steven Wu <[email protected]>
>>>>>>>>>>>>> пише:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Anton, thanks for raising this. I agree this deserves another
>>>>>>>>>>>>>> look. I added a comment in your doc that we can potentially 
>>>>>>>>>>>>>> apply the
>>>>>>>>>>>>>> column update proposal for data file update to the manifest file 
>>>>>>>>>>>>>> updates as
>>>>>>>>>>>>>> well, to colocate the data DV and data manifest files. Data DVs 
>>>>>>>>>>>>>> can be a
>>>>>>>>>>>>>> separate column in the data manifest file and updated separately 
>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>> column file. This is the same as the coalesced positional join 
>>>>>>>>>>>>>> that Anoop
>>>>>>>>>>>>>> mentioned.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Feb 1, 2026 at 4:14 PM Anoop Johnson <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for raising this, Anton. I had a similar
>>>>>>>>>>>>>>> observation while prototyping
>>>>>>>>>>>>>>> <https://github.com/apache/iceberg/pull/14533> the
>>>>>>>>>>>>>>> adaptive metadata tree. The overhead of doing a path-based hash 
>>>>>>>>>>>>>>> join of a
>>>>>>>>>>>>>>> data manifest with the affiliated delete manifest is high: my 
>>>>>>>>>>>>>>> estimate was
>>>>>>>>>>>>>>> that the join adds about 5-10% overhead. The hash table 
>>>>>>>>>>>>>>> build/probe alone
>>>>>>>>>>>>>>> takes about 5 ms for manifests with 25K entries. There are 
>>>>>>>>>>>>>>> engines that can
>>>>>>>>>>>>>>> do vectorized hash joins that can lower this, but the overhead 
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> complexity of a SIMD-friendly hash join is non-trivial.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> An alternative to relying on the external file feature in
>>>>>>>>>>>>>>> Parquet, is to make affiliated manifests order-preserving: ie 
>>>>>>>>>>>>>>> DVs in an
>>>>>>>>>>>>>>> affiliated delete manifest must appear in the same position as 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> corresponding data file in the data manifest the delete 
>>>>>>>>>>>>>>> manifest is
>>>>>>>>>>>>>>> affiliated to.  If a data file does not have a DV, the DV 
>>>>>>>>>>>>>>> manifest must
>>>>>>>>>>>>>>> store a NULL. This would allow us to do positional joins, which 
>>>>>>>>>>>>>>> are much
>>>>>>>>>>>>>>> faster. If we wanted, we could even have multiple affiliated DV 
>>>>>>>>>>>>>>> manifests
>>>>>>>>>>>>>>> for a data manifest and the reader would do a COALESCED 
>>>>>>>>>>>>>>> positional join
>>>>>>>>>>>>>>> (i.e. pick the first non-null value as the DV). It puts the 
>>>>>>>>>>>>>>> sorting
>>>>>>>>>>>>>>> responsibility to the writers, but it might be a reasonable 
>>>>>>>>>>>>>>> tradeoff.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, the options don't necessarily have to be mutually
>>>>>>>>>>>>>>> exclusive. We could still allow affiliated DVs to be "folded" 
>>>>>>>>>>>>>>> into data
>>>>>>>>>>>>>>> manifest (e.g. by background optimization jobs or the writer 
>>>>>>>>>>>>>>> itself). That
>>>>>>>>>>>>>>> might be the optimal choice for read-heavy tables because it 
>>>>>>>>>>>>>>> will halve the
>>>>>>>>>>>>>>> number of I/Os readers have to make.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Anoop
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jan 30, 2026 at 6:03 PM Anton Okolnychyi <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I had a chance to catch up on some of the V4 discussions.
>>>>>>>>>>>>>>>> Given that we are getting rid of the manifest list and 
>>>>>>>>>>>>>>>> switching to
>>>>>>>>>>>>>>>> Parquet, I wanted to re-evaluate the possibility of direct DV 
>>>>>>>>>>>>>>>> assignment
>>>>>>>>>>>>>>>> that we discarded in V3 to avoid regressions. I have put 
>>>>>>>>>>>>>>>> together my
>>>>>>>>>>>>>>>> thoughts in a doc [1].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> TL;DR:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - I think the current V4 proposal that keeps data and
>>>>>>>>>>>>>>>> delete manifests separate but introduces affinity is a solid 
>>>>>>>>>>>>>>>> choice for
>>>>>>>>>>>>>>>> cases when we need to replace DVs in many / most files. I 
>>>>>>>>>>>>>>>> outlined an
>>>>>>>>>>>>>>>> approach with column-split Parquet files but it doesn't 
>>>>>>>>>>>>>>>> improve the
>>>>>>>>>>>>>>>> performance and takes dependency on a portion of the Parquet 
>>>>>>>>>>>>>>>> spec that is
>>>>>>>>>>>>>>>> not really implemented.
>>>>>>>>>>>>>>>> - Pushing unaffiliated DVs directly into the root to
>>>>>>>>>>>>>>>> replace a small set of DVs is going to be fast on write but 
>>>>>>>>>>>>>>>> does require
>>>>>>>>>>>>>>>> resolving where those DVs apply at read time. Using inline 
>>>>>>>>>>>>>>>> metadata DVs
>>>>>>>>>>>>>>>> with column-split Parquet files is a little more promising in 
>>>>>>>>>>>>>>>> this case as
>>>>>>>>>>>>>>>> it allows to avoid unaffiliated DVs. That said, it again 
>>>>>>>>>>>>>>>> relies on
>>>>>>>>>>>>>>>> something Parquet doesn't implement right now, requires 
>>>>>>>>>>>>>>>> changing
>>>>>>>>>>>>>>>> maintenance operations, and yields minimal benefits.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> All in all, the V4 proposal seems like a strict improvement
>>>>>>>>>>>>>>>> over V3 but I insist that we reconsider usage of the 
>>>>>>>>>>>>>>>> referenced data file
>>>>>>>>>>>>>>>> path when resolving DVs to data files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1] -
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> сб, 22 лист. 2025 р. о 13:37 Amogh Jahagirdar <
>>>>>>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here is the meeting recording
>>>>>>>>>>>>>>>>> <https://drive.google.com/file/d/1lG9sM-JTwqcIgk7JsAryXXCc1vMnstJs/view?usp=sharing>
>>>>>>>>>>>>>>>>>  and generated meeting summary
>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1e50p8TXL2e3CnUwKMOvm8F4s2PeVMiKWHPxhxOW1fIM/edit?usp=sharing>.
>>>>>>>>>>>>>>>>> Thanks all for attending yesterday!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Nov 20, 2025 at 8:49 AM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I was out for some time, but set up a sync for tomorrow
>>>>>>>>>>>>>>>>>> at 9am PST. For this discussion, I do think it would be 
>>>>>>>>>>>>>>>>>> great to focus on
>>>>>>>>>>>>>>>>>> the manifest DV representation, factoring in analyses on 
>>>>>>>>>>>>>>>>>> bitmap
>>>>>>>>>>>>>>>>>> representation storage footprints, and the entry structure 
>>>>>>>>>>>>>>>>>> considering how
>>>>>>>>>>>>>>>>>> we want to approach change detection. If there are other 
>>>>>>>>>>>>>>>>>> topics that people
>>>>>>>>>>>>>>>>>> want to highlight, please do bring those up as well!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I also recognize that this is a bit short term
>>>>>>>>>>>>>>>>>> scheduling, so please do reach out to me if this time is 
>>>>>>>>>>>>>>>>>> difficult to work
>>>>>>>>>>>>>>>>>> with; next week is the Thanksgiving holidays here, and since 
>>>>>>>>>>>>>>>>>> people would
>>>>>>>>>>>>>>>>>> be travelling/out I figured I'd try to schedule before then.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Oct 17, 2025 at 9:03 AM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Sorry for the delay, here's the recording link
>>>>>>>>>>>>>>>>>>> <https://drive.google.com/file/d/1YOmPROXjAKYAWAcYxqAFHdADbqELVVf2/view>
>>>>>>>>>>>>>>>>>>>   from
>>>>>>>>>>>>>>>>>>> last week's discussion.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Oct 10, 2025 at 9:44 AM Péter Váry <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Same here.
>>>>>>>>>>>>>>>>>>>> Please record if you can.
>>>>>>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Oct 10, 2025, 17:39 Fokko Driesprong <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hey Amogh,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for the write-up. Unfortunately, I won’t be
>>>>>>>>>>>>>>>>>>>>> able to attend. Will it be recorded? Thanks!
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Op di 7 okt 2025 om 20:36 schreef Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I've setup time this Friday at 9am PST for another
>>>>>>>>>>>>>>>>>>>>>> sync on single file commits. In terms of what would be 
>>>>>>>>>>>>>>>>>>>>>> great to focus on
>>>>>>>>>>>>>>>>>>>>>> for the discussion:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 1. Whether it makes sense or not to eliminate the
>>>>>>>>>>>>>>>>>>>>>> tuple, and instead representing the tuple via 
>>>>>>>>>>>>>>>>>>>>>> lower/upper boundaries. As a
>>>>>>>>>>>>>>>>>>>>>> reminder, one of the goals is to avoid tying a partition 
>>>>>>>>>>>>>>>>>>>>>> spec to a
>>>>>>>>>>>>>>>>>>>>>> manifest; in the root we can have a mix of files 
>>>>>>>>>>>>>>>>>>>>>> spanning different
>>>>>>>>>>>>>>>>>>>>>> partition specs, and even in leaf manifests avoiding 
>>>>>>>>>>>>>>>>>>>>>> this coupling can
>>>>>>>>>>>>>>>>>>>>>> enable more desirable clustering of metadata.
>>>>>>>>>>>>>>>>>>>>>> In the vast majority of cases, we could leverage the
>>>>>>>>>>>>>>>>>>>>>> property that a file is effectively partitioned if the 
>>>>>>>>>>>>>>>>>>>>>> lower/upper for a
>>>>>>>>>>>>>>>>>>>>>> given field is equal. The nuance here is with the 
>>>>>>>>>>>>>>>>>>>>>> particular case of
>>>>>>>>>>>>>>>>>>>>>> identity partitioned string/binary columns which can be 
>>>>>>>>>>>>>>>>>>>>>> truncated in stats.
>>>>>>>>>>>>>>>>>>>>>> One approach is to require that writers must not produce 
>>>>>>>>>>>>>>>>>>>>>> truncated stats
>>>>>>>>>>>>>>>>>>>>>> for identity partitioned columns. It's also important to 
>>>>>>>>>>>>>>>>>>>>>> keep in mind that
>>>>>>>>>>>>>>>>>>>>>> all of this is just for the purpose of reconstructing 
>>>>>>>>>>>>>>>>>>>>>> the partition tuple,
>>>>>>>>>>>>>>>>>>>>>> which is only required during equality delete matching. 
>>>>>>>>>>>>>>>>>>>>>> Another area we
>>>>>>>>>>>>>>>>>>>>>> need to cover as part of this is on exact bounds on 
>>>>>>>>>>>>>>>>>>>>>> stats. There are other
>>>>>>>>>>>>>>>>>>>>>> options here as well such as making all new equality 
>>>>>>>>>>>>>>>>>>>>>> deletes in V4 be
>>>>>>>>>>>>>>>>>>>>>> global and instead match based on bounds, or keeping the 
>>>>>>>>>>>>>>>>>>>>>> tuple but each
>>>>>>>>>>>>>>>>>>>>>> tuple is effectively based off a union schema of all 
>>>>>>>>>>>>>>>>>>>>>> partition specs. I am
>>>>>>>>>>>>>>>>>>>>>> adding a separate appendix section outlining the span of 
>>>>>>>>>>>>>>>>>>>>>> options here and
>>>>>>>>>>>>>>>>>>>>>> the different tradeoffs.
>>>>>>>>>>>>>>>>>>>>>> Once we get this more to a conclusive state, I'll
>>>>>>>>>>>>>>>>>>>>>> move a summarized version to the main doc.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2. @[email protected] <[email protected]> has
>>>>>>>>>>>>>>>>>>>>>> updated the doc with a section
>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.rrpksmp8zkb#heading=h.qau0y5xkh9mn>
>>>>>>>>>>>>>>>>>>>>>>  on
>>>>>>>>>>>>>>>>>>>>>> how we can do change detection from the root in a 
>>>>>>>>>>>>>>>>>>>>>> variety of write
>>>>>>>>>>>>>>>>>>>>>> scenarios. I've done a review on it, and it covers the 
>>>>>>>>>>>>>>>>>>>>>> cases I would
>>>>>>>>>>>>>>>>>>>>>> expect. It'd be good for folks to take a look and please 
>>>>>>>>>>>>>>>>>>>>>> give feedback
>>>>>>>>>>>>>>>>>>>>>> before we discuss. Thank you Steven for adding that 
>>>>>>>>>>>>>>>>>>>>>> section and all the
>>>>>>>>>>>>>>>>>>>>>> diagrams.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Sep 18, 2025 at 3:19 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hey folks just following up from the discussion last
>>>>>>>>>>>>>>>>>>>>>>> Friday with a summary and some next steps:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 1.) For the various change detection cases, we
>>>>>>>>>>>>>>>>>>>>>>> concluded it's best just to go through those in an 
>>>>>>>>>>>>>>>>>>>>>>> offline manner on the
>>>>>>>>>>>>>>>>>>>>>>> doc since it's hard to verify all that correctness in a 
>>>>>>>>>>>>>>>>>>>>>>> large meeting
>>>>>>>>>>>>>>>>>>>>>>> setting.
>>>>>>>>>>>>>>>>>>>>>>> 2.) We mostly discussed eliminating the
>>>>>>>>>>>>>>>>>>>>>>> partition tuple. On the original proposal, I was mostly 
>>>>>>>>>>>>>>>>>>>>>>> aiming for the
>>>>>>>>>>>>>>>>>>>>>>> ability to re-constructing the tuple from the stats for 
>>>>>>>>>>>>>>>>>>>>>>> the purpose of
>>>>>>>>>>>>>>>>>>>>>>> equality delete matching (a file is partitioned if the 
>>>>>>>>>>>>>>>>>>>>>>> lower and upper
>>>>>>>>>>>>>>>>>>>>>>> bounds are equal); There's some nuance in how we need 
>>>>>>>>>>>>>>>>>>>>>>> to handle identity
>>>>>>>>>>>>>>>>>>>>>>> partition values since for string/binary they cannot be 
>>>>>>>>>>>>>>>>>>>>>>> truncated.
>>>>>>>>>>>>>>>>>>>>>>> Another potential option is to treat all equality 
>>>>>>>>>>>>>>>>>>>>>>> deletes as effectively
>>>>>>>>>>>>>>>>>>>>>>> global and narrow their application based on the stats 
>>>>>>>>>>>>>>>>>>>>>>> values. This may
>>>>>>>>>>>>>>>>>>>>>>> require defining tight bounds. I'm still collecting my 
>>>>>>>>>>>>>>>>>>>>>>> thoughts on this one.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks folks! Please also let me know if any of the
>>>>>>>>>>>>>>>>>>>>>>> following links are inaccessible for any reason.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Meeting recording link:
>>>>>>>>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Meeting summary:
>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/131N0CDpzZczURxitN0HGS7dTqRxQT_YS9jMECkGGvQU
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 3:40 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Update: I moved the discussion time to this Friday
>>>>>>>>>>>>>>>>>>>>>>>> at 9 am PST since I found out that quite a few folks 
>>>>>>>>>>>>>>>>>>>>>>>> involved in the
>>>>>>>>>>>>>>>>>>>>>>>> proposals will be out next week, and I also know some 
>>>>>>>>>>>>>>>>>>>>>>>> folks will also be
>>>>>>>>>>>>>>>>>>>>>>>> out the week after that.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Amogh J
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 8:57 AM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hey folks sorry for the late follow up here,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks @Kevin Liu <[email protected]> for
>>>>>>>>>>>>>>>>>>>>>>>>> sharing the recording link of the previous 
>>>>>>>>>>>>>>>>>>>>>>>>> discussion! I've set up another
>>>>>>>>>>>>>>>>>>>>>>>>> sync for next Tuesday 09/16 at 9am PST. This time 
>>>>>>>>>>>>>>>>>>>>>>>>> I've set it up from my
>>>>>>>>>>>>>>>>>>>>>>>>> corporate email so we can get recordings and 
>>>>>>>>>>>>>>>>>>>>>>>>> transcriptions (and I've made
>>>>>>>>>>>>>>>>>>>>>>>>> sure to keep the meeting invite open so we don't have 
>>>>>>>>>>>>>>>>>>>>>>>>> to manually let
>>>>>>>>>>>>>>>>>>>>>>>>> people in).
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> In terms of next steps of areas which I think
>>>>>>>>>>>>>>>>>>>>>>>>> would be good to focus on for establishing consensus:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 1. How do we model the manifest entry structure
>>>>>>>>>>>>>>>>>>>>>>>>> so that changes to manifest DVs can be obtained 
>>>>>>>>>>>>>>>>>>>>>>>>> easily from the root? There
>>>>>>>>>>>>>>>>>>>>>>>>> are a few options here; the most promising approach 
>>>>>>>>>>>>>>>>>>>>>>>>> is to keep an
>>>>>>>>>>>>>>>>>>>>>>>>> additional DV which encodes the diff in additional 
>>>>>>>>>>>>>>>>>>>>>>>>> positions which have
>>>>>>>>>>>>>>>>>>>>>>>>> been removed from a leaf manifest.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 2. Modeling partition transforms via expressions
>>>>>>>>>>>>>>>>>>>>>>>>> and establishing a unified table ID space so that we 
>>>>>>>>>>>>>>>>>>>>>>>>> can simplify how
>>>>>>>>>>>>>>>>>>>>>>>>> partition tuples may be represented via stats and 
>>>>>>>>>>>>>>>>>>>>>>>>> also have a way in the
>>>>>>>>>>>>>>>>>>>>>>>>> future to store stats on any derived column. I have a 
>>>>>>>>>>>>>>>>>>>>>>>>> short
>>>>>>>>>>>>>>>>>>>>>>>>> proposal
>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1oV8dapKVzB4pZy5pKHUCj5j9i2_1p37BJSeT7hyKPpg/edit?tab=t.0>
>>>>>>>>>>>>>>>>>>>>>>>>>  for
>>>>>>>>>>>>>>>>>>>>>>>>> this that probably still needs some tightening up on 
>>>>>>>>>>>>>>>>>>>>>>>>> the expression
>>>>>>>>>>>>>>>>>>>>>>>>> modeling itself (and some prototyping) but the 
>>>>>>>>>>>>>>>>>>>>>>>>> general idea for
>>>>>>>>>>>>>>>>>>>>>>>>> establishing a unified table ID space is covered. All 
>>>>>>>>>>>>>>>>>>>>>>>>> feedback welcome!
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Aug 25, 2025 at 1:34 PM Kevin Liu <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Amogh. Looks like the recording for last
>>>>>>>>>>>>>>>>>>>>>>>>>> week's sync is available on Youtube. Here's the link,
>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=uWm-p--8oVQ
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>> Kevin Liu
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Aug 12, 2025 at 9:10 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Just following up on this to give the community
>>>>>>>>>>>>>>>>>>>>>>>>>>> as to where we're at and my proposed next steps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I've been editing and merging the contents from
>>>>>>>>>>>>>>>>>>>>>>>>>>> our proposal into the proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw>
>>>>>>>>>>>>>>>>>>>>>>>>>>>  from
>>>>>>>>>>>>>>>>>>>>>>>>>>> Russell and others. For any future comments on 
>>>>>>>>>>>>>>>>>>>>>>>>>>> docs, please comment on the
>>>>>>>>>>>>>>>>>>>>>>>>>>> linked proposal. I've also marked it on our doc in 
>>>>>>>>>>>>>>>>>>>>>>>>>>> red text so it's clear
>>>>>>>>>>>>>>>>>>>>>>>>>>> to redirect to the other proposal as a source of 
>>>>>>>>>>>>>>>>>>>>>>>>>>> truth for comments.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> In terms of next steps,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. An important design decision point is around
>>>>>>>>>>>>>>>>>>>>>>>>>>> inline manifest DVs, external manifest DVs or 
>>>>>>>>>>>>>>>>>>>>>>>>>>> enabling both. I'm working on
>>>>>>>>>>>>>>>>>>>>>>>>>>> measuring different approaches for representing the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> compressed DV
>>>>>>>>>>>>>>>>>>>>>>>>>>> representation since that will inform how many 
>>>>>>>>>>>>>>>>>>>>>>>>>>> entries can reasonably fit
>>>>>>>>>>>>>>>>>>>>>>>>>>> in a small root manifest; from that we can derive 
>>>>>>>>>>>>>>>>>>>>>>>>>>> implications on different
>>>>>>>>>>>>>>>>>>>>>>>>>>> write patterns and determine the right approach for 
>>>>>>>>>>>>>>>>>>>>>>>>>>> storing these manifest
>>>>>>>>>>>>>>>>>>>>>>>>>>> DVs.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Another key point is around determining
>>>>>>>>>>>>>>>>>>>>>>>>>>> if/how we can reasonably enable V4 to represent 
>>>>>>>>>>>>>>>>>>>>>>>>>>> changes in the root
>>>>>>>>>>>>>>>>>>>>>>>>>>> manifest so that readers can effectively just infer 
>>>>>>>>>>>>>>>>>>>>>>>>>>> file level changes from
>>>>>>>>>>>>>>>>>>>>>>>>>>> the root.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. One of the aspects of the proposal is getting
>>>>>>>>>>>>>>>>>>>>>>>>>>> away from partition tuple requirement in the root 
>>>>>>>>>>>>>>>>>>>>>>>>>>> which currently holds us
>>>>>>>>>>>>>>>>>>>>>>>>>>> to have associativity between a partition spec and 
>>>>>>>>>>>>>>>>>>>>>>>>>>> a manifest. These
>>>>>>>>>>>>>>>>>>>>>>>>>>> aspects can be modeled as essentially column stats 
>>>>>>>>>>>>>>>>>>>>>>>>>>> which gives a lot of
>>>>>>>>>>>>>>>>>>>>>>>>>>> flexibility into the organization of the manifest. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> There are important
>>>>>>>>>>>>>>>>>>>>>>>>>>> details around field ID spaces here which tie into 
>>>>>>>>>>>>>>>>>>>>>>>>>>> how the stats are
>>>>>>>>>>>>>>>>>>>>>>>>>>> structured. What we're proposing here is to have a 
>>>>>>>>>>>>>>>>>>>>>>>>>>> unified expression ID
>>>>>>>>>>>>>>>>>>>>>>>>>>> space that could also benefit us for storing things 
>>>>>>>>>>>>>>>>>>>>>>>>>>> like virtual columns
>>>>>>>>>>>>>>>>>>>>>>>>>>> down the line. I go into this in the proposal but 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm working on separating
>>>>>>>>>>>>>>>>>>>>>>>>>>> the appropriate parts so that the original proposal 
>>>>>>>>>>>>>>>>>>>>>>>>>>> can mostly just focus
>>>>>>>>>>>>>>>>>>>>>>>>>>> on the organization of the content metadata tree 
>>>>>>>>>>>>>>>>>>>>>>>>>>> and not how we want to
>>>>>>>>>>>>>>>>>>>>>>>>>>> solve this particular ID space problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. I'm planning on scheduling a recurring
>>>>>>>>>>>>>>>>>>>>>>>>>>> community sync starting next Tuesday at 9am PST, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> every 2 weeks. If I get
>>>>>>>>>>>>>>>>>>>>>>>>>>> feedback from folks that this time will never work, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I can certainly adjust.
>>>>>>>>>>>>>>>>>>>>>>>>>>> For some reason, I don't have the ability to add to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> the Iceberg Dev
>>>>>>>>>>>>>>>>>>>>>>>>>>> calendar, so I'll figure that out and update the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> thread when the event is
>>>>>>>>>>>>>>>>>>>>>>>>>>> scheduled.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 11:47 AM Russell Spitzer
>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think this is a great way forward, starting
>>>>>>>>>>>>>>>>>>>>>>>>>>>> out with this much parallel development shows that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have a lot of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> consensus already :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 12:42 PM Amogh
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jahagirdar <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey folks, just following up on this. It looks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like our proposal and the proposal that @Russell
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Spitzer <[email protected]> shared
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are pretty aligned. I was just chatting with 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Russell about this, and we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> think it'd be best to combine both proposals and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have a singular large
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> effort on this. I can also set up a focused 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> community discussion (similar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to what we're doing on the other V4 proposals) on 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this starting sometime
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> next week just to get things moving, if that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> works for people.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 9:48 PM Amogh
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jahagirdar <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Russell,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for sharing the proposal! A few of us
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (Ryan, Dan, Anoop and I) have also been working 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on a proposal for an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adaptive metadata tree structure as part of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enabling more efficient one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file commits. From a read of the summary, it's 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> great to see that we're
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thinking along the same lines about how to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tackle this fundamental area!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is our proposal:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 8:08 PM Russell
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Spitzer <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey y'all!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We (Yi Fang, Steven Wu and Myself) wanted to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> share some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the thoughts we had on how one-file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> commits could work in Iceberg. This is pretty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> much just a high level overview of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concepts we think we need and how Iceberg would 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behave.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We haven't gone very far into the actual
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation and changes that would need to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> occur in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SDK to make this happen.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The high level summary is:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Manifest Lists are out
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Root Manifests take their place
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   A Root manifest can have data manifests,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delete manifests, manifest delete vectors, data 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delete vectors and data
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   Manifest delete vectors allow for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> modifying a manifest without deleting it 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entirely
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   Data files let you append without writing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an intermediary manifest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   Having child data and delete
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifests lets you still scale
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please take a look if you like,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm excited to see what other proposals and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideas are floating around the community,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Russ
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Very excited about the idea!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Johnson <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm very interested in this initiative.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Micah Kornfield and I presented
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on high-throughput ingestion for Iceberg 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables at the 2024 Iceberg Summit,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which leveraged Google infrastructure like 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Colossus for efficient appends.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This new proposal is particularly exciting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it offers significant advancements in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> commit latency and metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> storage footprint. Furthermore, a consistent 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifest structure promises to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simplify the design and codebase, which is a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> major benefit.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A related idea I've been exploring is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> having a loose affinity between data and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delete manifests. While the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current separation of data and delete 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifests in Iceberg is valuable for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> avoiding data file rewrites (and stats 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> updates) when deletes change, it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does necessitate a join operation during 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reads. I'd be keen to discuss
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> approaches that could potentially reduce this 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> read-side cost while
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> retaining the benefits of separate manifests.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anoop
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sidhu <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am new to the Iceberg community but
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would love to participate in these 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussions to reduce the number of file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> writes, especially for small writes/commits.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Jagdeep
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Mantripragada 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We have been hitting all the metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problems you mentioned, Ryan. I’m on-board 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to help however I can to improve
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this area.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~ Anurag Mantripragada
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheng <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am interested in this idea and looking
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> forward to collaboration.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Huang-Hsiang
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am interested in contributing to this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> effort.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Namratha
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jahagirdar <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for kicking this thread off
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ryan, I'm interested in helping out here! 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've been working on a proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in this area and it would be great to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> collaborate with different folks and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange ideas here, since I think a lot 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of people are interested in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solving this problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Blue <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Like Russell’s recent note, I’m
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> starting a thread to connect those of us 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that are interested in the idea of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> changing Iceberg’s metadata in v4 so that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in most cases committing a change
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only requires writing one additional 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metadata file.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *Idea: One-file commits*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The current Iceberg metadata structure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requires writing at least one manifest 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and a new manifest list to produce a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> new snapshot. The goal of this work is to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> allow more flexibility by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> allowing the manifest list layer to store 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data and delete files. As a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> result, only one file write would be 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> needed before committing the new
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> snapshot. In addition, this work will 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also try to explore:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Avoiding small manifests that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    must be read in parallel and later 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compacted (metadata maintenance changes)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Extend metadata skipping to use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    aggregated column ranges that are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatible with geospatial data (manifest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    metadata)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Using soft deletes to avoid
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    rewriting existing manifests (metadata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DVs)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you’re interested in these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problems, please reply!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] v4 - One file commits

Reply via email to