[NOTES] 23 June 2021 Iceberg Community Meeting

Carl Steinbach Thu, 01 Jul 2021 09:42:06 -0700

Iceberg Community Meetings are open to everyone. To receive an invitation
to the next meeting, please join the [email protected]
<https://groups.google.com/g/iceberg-sync> list.Special thanks to Ryan Blue
for contributing most of these notes.Attendees: Anjali Norwood, Badrul
Chowdhury, Ben Mears, Dan Weeks, Gustavo Torres Torres, Jack Ye, Karuppayya
Rajendran, Kyle Bendickson, Parth Brahmbhatt, Russel Spitzer, Ryan Blue,
Sreeram Garlapati, Szehon Ho, Wing Yew Poon, Xinbin Huang, Yan Yan, Carl
Steinbach



   -

   Highlights
   -

      JDBC catalog was committed (Thanks, Ismail!)
      -

      DynamoDB catalog was committed (Thanks, Jack!)
      -

      Added predicate pushdown for partitions metadata table (Thanks,
      Szehon!)
      -

   Releases
   -

      0.12.0
      -

         New Actions API update
         - Almost done with compaction.
         -

            Need to make the old API deprecated (to confirm)
            -

         Spark 3.1 support
         -

            Recently rebased on master
            https://github.com/apache/iceberg/pull/2512
            - No longer adds new modules, should be ready to commit.
         -

      Feature-based or time-based release cycle?
      -

         Carl: A time-based release cycle would be more predictable, not
         slipping because of some feature that isn’t quite ready. This could be
         monthly or quarterly.
         -

         Ryan: We already try not to hold back releases to get features in
         because it is better to release more often than to let them
slip. But we
         could be better about this. It’s important to continuously
release so that
         changes get back out to contributors.
         -

         The consensus was to discuss this on the dev list. It is a
         promising idea.
         -

      Iceberg 1.0?
      -

         Carl: Semver is a lie, and there is a public perception around 1.0
         releases. Should we go ahead and target a 1.0 soon?
         -

         Ryan: What do you mean that semver is a lie?
         -

         Carl: If semver were followed carefully, most projects would be on
         a major version in the 100s. Many things change, and the
version doesn’t
         always reflect it.
         -

         Ryan: That’s fair, but I think people still make downstream
         decisions based on how those version numbers change.
         -

         Jack: There is an expectation that breaking changes are signaled
         by increasing the major version, or more accurately, that not
increasing
         the major version indicates no major APIs are broken.
         -

         Ryan: Also, bumping up to 1.0 is when people start expecting more
         rigid enforcement of semver, even if it isn’t always done. If
we want to
         update to 1.0 and/or drop semver, we should figure out our
guarantees and
         document them clearly. And we should also prepare for more
API stability.
         Maybe add binary compatibility checks to the build.
         -

         The consensus was to discuss this more on the dev list and target
         a 1.0 for later this year with clear guidelines about API
compatibility.
         -

   New slack community: apache-iceberg.slack.com
   <https://communityinviter.com/apps/apache-iceberg/apache-iceberg-website>
   -

      It’s easy to sign up for ASF Slack here:
      https://s.apache.org/slack-invite
      -

      No need for an independent Iceberg workspace.
      -

   Any updates on the secondary index design?
   - Miao and Guy weren’t at the meeting, so no update.
   - Jack is going to look into this and help out.
   -

   Github triage permissions for project contributors
   -

      Carl opened an INFRA ticket for anyone with 2 or more contributions
      -

      We Will see if infra can add everyone.
      -

      Ref: INFRA-22026, INFRA-22031
      -

   Updating partitioning via Optimize/RewriteDataFiles
   -

      Russell: We ran into an issue where compaction with multiple
      partition specs will create many small files---planning groups files by
      current spec, but writing can split data for the new spec. Since
this is a
      rare event (unmerged data in an old spec), the solution is to merge files
      for the old spec separately.
      -

      Ryan: sounds reasonable.
      -

   Low-latency streaming
   -

      Sreeram: We are trying to see how frequently we can commit to an
      Iceberg table. Looking to get to commits every 1-2 seconds. One
main issue
      we’ve found is that there are several metadata files written for every
      commit: at least one manifest, the manifest list, and the metadata JSON
      file. Plus, the metadata JSON file has many snapshots and gets
quite large
      (3MB+) after a day of frequent commits. Is there a way to improve on how
      the JSON file tracks snapshots?
      -

      Ryan: There is space to improve this. I’ve thought about replacing
      the JSON file with a database so that changes are more targeted and don’t
      require rewriting all of the information. This is supported by the
      TableOperations API, which swaps TableMetadata objects. The JSON
file isn’t
      really required by the implementation, although it has become popular
      because it places all of the table metadata in the file system. So the
      source of truth is entirely in the table’s files.
      -

      Sreeram: What about writing diffs of the JSON files? We could, for
      example, write a new snapshot as the only content in a new JSON file.
      -

      Ryan: You could come up with a way to do that, but what you’d want to
      avoid is needing to read lots of files to reconstruct the table’s current
      state. If you’re trying to put together the history or snapshots metadata
      tables, you don’t want to read the current file, its parent, that file’s
      parent, and so on. (That’s an easy design flaw to fall into.) What you
      should do instead is choose a base version and write all differences
      against that. We’d need to define the format for JSON diffs.
      -

      Ryan: And, I think it may be more useful to replace the JSON file
      with a database because this could introduce more commit conflicts. When
      the JSON file is periodically rewritten entirely to produce a new base
      version, this operation may fail due to faster commits from
other writers.
      That would be bad for a table.
      -

      Ryan: What is the use case for this? 1-2 seconds is very frequent and
      causes other issues, like small data files that need to be
compacted, plus
      compaction commit retries because of frequent, ongoing commits.
      -

      Sreeram: The idea is to see if we can replace Kafka with an Iceberg
      table in workflows.
      -

      Ryan: I don’t think that’s something you’d want to do. Iceberg just
      isn’t designed for that kind of use case, and that is what Kafka does
      really well.
      -

      Kyle: Yeah, you’d definitely want to use Kafka for that. Iceberg is
      good for long-term storage and isn’t a good replacement.
      -

   Purge Behaviors
   -

      Russell: Spark’s new API passes a purge flag through DROP TABLE. Do
      we want to respect that flag?
      -

      Ryan: Yes? Why wouldn’t we?
      -

      Russell: Not everyone wants to purge data.
      -

      Ryan: Agreed. Netflix wouldn’t do this because they often have to
      restore tables. But that’s something that Netflix can turn off in their
      catalog. For the built-in catalogs, we should probably support
the expected
      behavior.
      -

   Deduplication? As part of rewrite
   -

      Kyle [I think]: What is the story around deduplication? Duplicate
      records are a common problem.
      -

      Ryan: Iceberg didn’t have one before, but now that we have a way to
      identify records, thanks to Jack adding the row identifier
fields, we could
      build something in this space. Maybe a background service that detects
      duplicates and rewrites? But we would want to be careful here because it
      could easily attempt to read an entire table if the partition spec is not
      aligned with the identifier fields.

[NOTES] 23 June 2021 Iceberg Community Meeting

Reply via email to