@tqchen Thanks for the reply! A couple of points:
- You're right that I2/I3 don't capture all dependencies. There is a notable
gap between a set of Python dependencies and the docker container: the
remaining libraries which may be present in the container.
- The method by which those libraries are installed is platform-dependent, but
so far our approach for the CI has been mostly to depend on a package manager
to satisfy them.
- I think that even if a lockfile can't be used primarily to reproduce the test
environment, there are plenty of other uses--here are a couple:
- Determining which version of a C library is suitable when trying to
install TVM on a system other than the docker containers. You might prefer to
at least know the version of the Python package which uses it. I had trouble
finding a non-dev dependency like this but I think this pattern is very common,
and
[Pillow](https://pillow.readthedocs.io/en/latest/installation.html#external-libraries)
is an example of such a package.
- Often times a user isn't trying to exactly reproduce the regression--they
just want to know which minor version of a package to install e.g. tensorflow
2.1 vs 2.2. This can make a big difference, especially when the main problem is
API incompatibilty.. Even if a user might not 100% be able to reproduce the
regression, it's often the case that if they're using TVM in an application,
they won't be able to use the TVM CI containers as a starting point.
### Updating Docker Containers
I think it's worth thinking through the case of updating the docker containers
a little more because I'm not sure I follow your reasoning here.
The current process is as follows, per-container:
1. update docker/install and Dockerfile as needed to modify the container
2. run `docker/build.sh` to build the new container image
3. Fix problems and rerun steps 1 & 2 as necessary.
4. test the containers -- discussed later
The current build process is quite straightforward--it's just building a docker
container. I2 and I3 propose to add these changes:
1. prior to building the containers, create a lockfile
2. rebuild all containers when the lockfile changes, rather than just updating
one
Change 2 shouldn't significantly burden developers unless the build process is
error-prone. We should fix that if it is, and perhaps a nightly container build
isn't a bad Jenkins job to configure so we can address these failures as they
happen.
Change 1 is the one that needs to be thought through. Ordinarily, building a
lockfile is running a single command on the target platform, and often this
process takes about a minute. This would not be a significant burden. The
tricky part is: we have multiple platforms in the CI (three, in fact:
manylinux1_i686, manylinux1_x86_64, and manylinux_arm) and it's not possible to
build the lockfile for a platform other than the one you're running on.
In thinking this through again, I realized that actually this is the same
problem as the one we have with Windows: it's not possible to use a single
lockfile for all platforms. If you are switching platforms, you are going to
have to accept you may need to switch dependency versions.
So let's return to the ultimate goals of I2/I3 approach:
1. Ensure all features of TVM can actually be used simultaneously on a given
platform.
2. Standardize dependencies so that the unittests don't fail for different
reasons on different containers.
3. provide enough input to allow the `tlcpack` build process to place narrower
constraints on `install_requires` included there, based on the versions used in
CI
In light of the above, all of these goals can be achieved only within a single
pip platform. So I'd propose a new way to install dependencies while addressing
this, I4:
I4. A multi-part approach:
1. For each pip platform that has a ci- container built around it: produce a
lockfile either by:
1. `pip install -r` all TVM requirements and `pip freeze`
2. `poetry lock`
The lockfile becomes part of the artifacts created when the containers are
built. When multiple containers are built for the same pip platform (e.g.
ci-lint, ci-cpu, ci-gpu, ci-qemu, ci-wasm), pick one container which will build
the lockfile. The rest will merely consume this lockfile when installing
packages either as pip constraints file or poetry lockfile.
2. I would argue that between platforms, the version standardization we care
about is effectively TVM direct Python dependencies to the minor version level.
For instance, if we add Windows tests, we prefer to test against e.g.
tensorflow 1.6 on both linux and Windows. To address this, when building
containers, we would first build a pilot platform e.g. `manylinux1_x86_64`,
scan the resulting lockfile for direct dependencies of TVM, and produce a loose
constraint file for the remaining platforms. For example, if the lockfile
contained an entry `tensorflow==1.6.8`, we may add `tensorflow>=1.6,<1.7` to
the cross-platform lockfile.
-> This may not always be possible (conda may not update fast enough for us,
etc), so we will need to have an override file per-platform. This is complex,
but it's likely that this is happening anyway in the release process now, and
the reasons why different packages are being installed aren't documented
anywhere.
### Checking in new containers
To check in the built containers, these things need to happen:
1. upload candidate containers to docker hub under your username
2. prepare a TVM PR which updates the Jenkinsfile to reference these test
containers
3. have a TVM committer push the PR to
[ci-docker-staging](https://github.com/apache/tvm/tree/ci-docker-staging).
4. await test results and iterate as needed.
5. push candidate containers to `tlcpack` docker hub account
6. amend the TVM PR to point at `tlcpack` images and submit it.
Here we're proposing that in step 2, we also include the per-platform lockfiles
produced from container building.
### `tlcpack` Dependencies
Finally: I4 proposes we produce a cross-platform constraints file. This isn't
discussed as part of this RFC (it's the next one), but it would be nice to
produce a `tlcpack` wheel which includes some loose version constraints (e.g.
so that by depending on tensorflow, a user running `pip install tlcpack` would
install tensorflow 1.x and not 2.x). The version constraints in the
cross-platform constraints file are exactly those we should use in such a
`tlcpack` wheel.
Previously, the follow-on RFC had proposed to analyze the CI lockfile and
produce the `tlcpack` dependencies. I4 just proposes to move that analysis
earlier, and effectively means that the CI would be testing against the same
constraints present in the `tlcpack` wheel. This seems like a better change to
me.
---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-python-dependencies-in-tvm-ci-containers/9011/7)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/236310e0290ee593ac5dc3d79d1abe4c49dcf88bb76c52dca48debb17eb7a68f).