I'm not sure that an update to the diagram is needed. Adding stats files and delete files makes the diagram harder to follow. And its purpose is to clearly show the high-level structure of the format to help people understand the basic idea, not to show all of the details. I think more complexity undermines that purpose so I'd probably not change it.
On Wed, Nov 8, 2023 at 9:34 AM Jason Hughes <[email protected]> wrote: > since v2 has been out for a while and most tools that support iceberg > support v2 (not to mention some only support v2), I think having a single > diagram and using dotted lines for the delete manifests and delete files > will cause more confusion than benefit. also because of the support and > adoption of v2, personally I'm in favor of replacing the arch diagram with > this one that's for v2. that said, if folks are in favor of it, I can also > edit the v1 table diagram to include stats files too and have them coexist > on the spec page, noting which is v1 and which is v2 > > what does everyone think? > > > Jason Hughes > > > Dremio | Director of Technical Advocacy > > > > > > > On Mon, Nov 6, 2023 at 12:47 AM Ajantha Bhat <[email protected]> > wrote: > >> However, there are a lot of boxes and new terms. What do you think of >>> keeping both files, and indicating that the old applies to V1 tables, and >>> the new one to V2 tables. >> >> >> Statistics are common for both V1 and V2. So, we can't say old applies to >> V1 and new applies to V2. >> For delete, we are using existing boxes. >> So, I think we can keep only one image with dotted delete manifest and >> delete files mentioning it is specific to V2 merge-on-read condition. >> >> Suggestions are welcome. >> >> On Mon, Nov 6, 2023 at 1:54 PM Eduard Tudenhoefner < >> [email protected]> wrote: >> >>> Thanks for updating the diagram and +1 to Fokko's suggestion. >>> >>> On Fri, Nov 3, 2023 at 3:43 PM Fokko Driesprong <[email protected]> >>> wrote: >>> >>>> Hey Jason, thanks for updating the chart. >>>> >>>> I like it a lot. However, there are a lot of boxes and new terms. What >>>> do you think of keeping both files, and indicating that the old applies to >>>> V1 tables, and the new one to V2 tables. >>>> >>>> Kind regards, >>>> Fokko >>>> >>>> Op vr 3 nov 2023 om 14:37 schreef Aaron Niskode-Dossett >>>> <[email protected]>: >>>> >>>>> An update would be greatly appreciated, thank you! >>>>> >>>>> On Thu, Nov 2, 2023 at 12:42 PM Jason Hughes <[email protected]> >>>>> wrote: >>>>> >>>>>> Hey all, >>>>>> >>>>>> The current architecture diagram >>>>>> <https://iceberg.apache.org/img/iceberg-metadata.png> for an iceberg >>>>>> table hasn't been updated in over 3 years, and there's are some aspects >>>>>> to >>>>>> the architecture of an iceberg table that have changed, most notably >>>>>> delete >>>>>> files and puffin files. since this diagram gets a lot of use in >>>>>> enablement >>>>>> content around the community and isn't totally accurate anymore, @Ajantha >>>>>> Bhat U <[email protected]> and I discussed updating it to be >>>>>> more accurate >>>>>> >>>>>> here's an updated version of the diagram >>>>>> <https://docs.google.com/drawings/d/1m_iiJIJjiymadFIsCYnuUS6BvFo0MYDPCx0kKhZgIx4/edit> >>>>>> we put together >>>>>> >>>>>> a few points for discussion that we're interested in others' thoughts >>>>>> on: >>>>>> >>>>>> 1. the diagram is obviously somewhat more visually complicated >>>>>> than the current one, but IMO the benefit of being more accurate for >>>>>> people >>>>>> learning iceberg outweighs the additional complexity >>>>>> 2. since the partition stats spec PR >>>>>> <https://github.com/apache/iceberg/pull/7105> just got merged, we >>>>>> thought it'd be good to include that too while we're updating it, and >>>>>> combine puffin files with partition stats files into one category of >>>>>> files >>>>>> in the diagram labeled "statistics files". we combined them in the >>>>>> diagram, >>>>>> rather than splitting them up, because 1. it provides a simpler >>>>>> diagram, 2. >>>>>> gets the primary point across, and 3. they both serve the purpose of >>>>>> providing statistics for tools to leverage (albeit for different use >>>>>> cases) >>>>>> 3. we put statistics files in place in the diagram for both s0 >>>>>> and s1, though we could only have statistics files for s1, which >>>>>> would 1. >>>>>> make the diagram simpler, and 2. show a simple example of the use >>>>>> case of >>>>>> not needing stats files initially, but then as data grows and/or query >>>>>> patterns change, now stats files are needed >>>>>> >>>>>> if folks are on board with updating the diagram, and after we come to >>>>>> a conclusion on the above discussion points and any others that come up, >>>>>> I >>>>>> can export it to a png and create a PR to update the arch diagram image >>>>>> on >>>>>> the site >>>>>> >>>>>> thanks! >>>>>> >>>>>> >>>>>> Jason Hughes >>>>>> >>>>>> >>>>>> Dremio | Director of Technical Advocacy >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Aaron Niskode-Dossett, Data Engineering -- Etsy >>>>> >>>> -- Ryan Blue Tabular
