@tqchen Thanks for the reply! A couple of points:
- You're right that I2/I3 don't capture all dependencies. There is a notable gap between a set of Python dependencies and the docker container: the remaining libraries which may be present in the container. - The method by which those libraries are installed is platform-dependent, but so far our approach for the CI has been mostly to depend on a package manager to satisfy them. - I think that even if a lockfile can't be used primarily to reproduce the test environment, there are plenty of other uses--here are a couple: - Determining which version of a C library is suitable when trying to install TVM on a system other than the docker containers. You might prefer to at least know the version of the Python package which uses it. I had trouble finding a non-dev dependency like this but I think this pattern is very common, and [Pillow](https://pillow.readthedocs.io/en/latest/installation.html#external-libraries) is an example of such a package. - Often times a user isn't trying to exactly reproduce the regression--they just want to know which minor version of a package to install e.g. tensorflow 2.1 vs 2.2. This can make a big difference, especially when the main problem is API incompatibilty.. Even if a user might not 100% be able to reproduce the regression, it's often the case that if they're using TVM in an application, they won't be able to use the TVM CI containers as a starting point. ### Updating Docker Containers I think it's worth thinking through the case of updating the docker containers a little more because I'm not sure I follow your reasoning here. The current process is as follows, per-container: 1. update docker/install and Dockerfile as needed to modify the container 2. run `docker/build.sh` to build the new container image 3. Fix problems and rerun steps 1 & 2 as necessary. 4. test the containers -- discussed later The current build process is quite straightforward--it's just building a docker container. I2 and I3 propose to add these changes: 1. prior to building the containers, create a lockfile 2. rebuild all containers when the lockfile changes, rather than just updating one Change 2 shouldn't significantly burden developers unless the build process is error-prone. We should fix that if it is, and perhaps a nightly container build isn't a bad Jenkins job to configure so we can address these failures as they happen. Change 1 is the one that needs to be thought through. Ordinarily, building a lockfile is running a single command on the target platform, and often this process takes about a minute. This would not be a significant burden. The tricky part is: we have multiple platforms in the CI (three, in fact: manylinux1_i686, manylinux1_x86_64, and manylinux_arm) and it's not possible to build the lockfile for a platform other than the one you're running on. In thinking this through again, I realized that actually this is the same problem as the one we have with Windows: it's not possible to use a single lockfile for all platforms. If you are switching platforms, you are going to have to accept you may need to switch dependency versions. So let's return to the ultimate goals of I2/I3 approach: 1. Ensure all features of TVM can actually be used simultaneously on a given platform. 2. Standardize dependencies so that the unittests don't fail for different reasons on different containers. 3. provide enough input to allow the `tlcpack` build process to place narrower constraints on `install_requires` included there, based on the versions used in CI In light of the above, all of these goals can be achieved only within a single pip platform. So I'd propose a new way to install dependencies while addressing this, I4: I4. A multi-part approach: 1. For each pip platform that has a ci- container built around it: produce a lockfile either by: 1. `pip install -r` all TVM requirements and `pip freeze` 2. `poetry lock` The lockfile becomes part of the artifacts created when the containers are built. When multiple containers are built for the same pip platform (e.g. ci-lint, ci-cpu, ci-gpu, ci-qemu, ci-wasm), pick one container which will build the lockfile. The rest will merely consume this lockfile when installing packages either as pip constraints file or poetry lockfile. 2. I would argue that between platforms, the version standardization we care about is effectively TVM direct Python dependencies to the minor version level. For instance, if we add Windows tests, we prefer to test against e.g. tensorflow 1.6 on both linux and Windows. To address this, when building containers, we would first build a pilot platform e.g. `manylinux1_x86_64`, scan the resulting lockfile for direct dependencies of TVM, and produce a loose constraint file for the remaining platforms. For example, if the lockfile contained an entry `tensorflow==1.6.8`, we may add `tensorflow>=1.6,<1.7` to the cross-platform lockfile. -> This may not always be possible (conda may not update fast enough for us, etc), so we will need to have an override file per-platform. This is complex, but it's likely that this is happening anyway in the release process now, and the reasons why different packages are being installed aren't documented anywhere. ### Checking in new containers To check in the built containers, these things need to happen: 1. upload candidate containers to docker hub under your username 2. prepare a TVM PR which updates the Jenkinsfile to reference these test containers 3. have a TVM committer push the PR to [ci-docker-staging](https://github.com/apache/tvm/tree/ci-docker-staging). 4. await test results and iterate as needed. 5. push candidate containers to `tlcpack` docker hub account 6. amend the TVM PR to point at `tlcpack` images and submit it. Here we're proposing that in step 2, we also include the per-platform lockfiles produced from container building. ### `tlcpack` Dependencies Finally: I4 proposes we produce a cross-platform constraints file. This isn't discussed as part of this RFC (it's the next one), but it would be nice to produce a `tlcpack` wheel which includes some loose version constraints (e.g. so that by depending on tensorflow, a user running `pip install tlcpack` would install tensorflow 1.x and not 2.x). The version constraints in the cross-platform constraints file are exactly those we should use in such a `tlcpack` wheel. Previously, the follow-on RFC had proposed to analyze the CI lockfile and produce the `tlcpack` dependencies. I4 just proposes to move that analysis earlier, and effectively means that the CI would be testing against the same constraints present in the `tlcpack` wheel. This seems like a better change to me. --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-python-dependencies-in-tvm-ci-containers/9011/7) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/236310e0290ee593ac5dc3d79d1abe4c49dcf88bb76c52dca48debb17eb7a68f).