Hi all, I have put together a proposal for implementing remote caching support in the SCons codebase. Please send any comments that you have. A pdf version is attached. The unformatted text contents are below:
-------------------------------
This page is intended to provide a summary of our proposal for remote caching
in SCons in the upstream code.
1 Introduction
1.1 Background
The SCons build system has support for a local cache in its CacheDir class, but
no support for remote caching. At least one company that does all of its
development in the Amazon cloud has overlaid the CacheDir class on top of a
shared directory to provide distributed caching, but that is not a solution for
distributed development teams.
1.2 Hashing System
SCons currently only uses MD5 hashes. This document is mostly agnostic about
the hash system, only discussing them when they impact the security or design
of this feature. We intentionally use the term "content signature" (csig) as a
synonym for hash.
1.3 Proposed Solution
We propose the following design:
1. Bazel remote cache server will be used as the action metadata and binary
content store
a. This provides a battle-tested solution that is known to scale.
2. Remote cache fetch (off by default) can be enabled as asynchronous
communication
3. Remote cache push (off by default) can be enabled as asynchronous
communication
This is accomplished using following high-level architecture in SCons:
1. A RemoteCache class in SCons's core layer owns communication with the Bazel
remote cache server
2. SCons opts into a new scheduler from Job.py iff cache fetch is enabled
3. Together, these solutions ensure both CPUs and cache fetching are optimally
scheduled/utilized
Important note: when remote caching is enabled, local directory-based caching
via the CacheDir class will be disabled. It is possible to be able to support
these together in the future, but that is not currently planned.
2 Technical Design
2.1 Bazel Remote Cache Server
2.1.1 Integration Design
The Bazel remote cache server is an existing server component that is designed
to support GET, HEAD, and PUT requests for the following types of information:
1. Action metadata, stored under the /ac/ section
2. Binary data, stored under the /cas/ section
One important design principle for the Bazel remote cache server is that the
signature of the action metadata stored under the /ac/ section does not need to
match the actual hash of the data, but the binary data under /cas/ does need to
match. Specifically, the Bazel remote cache will reject PUT requests under
/cas/ if the hash of the data that it receives does not match the hash provided
in the URL. But it will not reject PUT requests under /ac/ for that same reason.
2.1.2 Action Metadata
We will store one set of action metadata under the /ac/ section per SCons Task
object. This will allow us to do remote cache lookup on a per-Task basis. The
action metadata will be formatted in JSON and will only contain a list of the
targets for the task and any supporting data that we need, which is depends on
the platform. All platforms will include the csig and size. Posix platforms
will also include the mode. So for example, we could retrieve the following
data for a task built on Windows with two targets:
{
'build/subfolder/a.dll':
{
'csig': '<hash_of_file_contents>',
'size': <file_size_in_bytes>
},
'build/subfolder/a.lib':
{
'csig': '<hash_of_file_contents>',
'size': <file_size_in_bytes>
}
}
or on Linux we could retrieve the following data for a task built with one
target:
{
'build/subfolder/a.so':
{
'csig': '<hash of file contents>',
'size': <file size in bytes>,
'mode': <st_mode value>
}
}
We will generate the hash used in the /ac/ request using the following code:
return SCons.Util.SignatureCollect(
['scons-metadata-version-%d' % self.metadata_version] +
[t.get_cachedir_bsig() for t in task.targets])
The actual hash mechanism for SignatureCollect is TBD. See the Threat Modeling
section at the end for more details about hash mechanism choices.
This metadata_version identifier will start at 1 and will be incremented every
time we change the contents of the metadata. This allows us to change the
metadata format without colliding with old consumers.
2.1.3 Changes Needed To Bazel Remote Cache Server
Currently the Bazel remote cache server only supports SHA-256 for requests
(e.g. GET http://bazel-cache.corp.int/cache/ac/<sha_256_hash>), while SCons by
default uses MD5. As part of this project, VMware will be contributing code to
the upstream Bazel remote cache server project to support MD5 and SHA-1. We
have received confirmation from the project maintainer that (1) it is
acceptable to do this and (2) no prefix is needed for these alternative hashing
formats. As a result, the requests SCons would make would be of the form
http://bazel-cache.corp.int/cache/ac/<md5_hash> or
http://bazel-cache.corp.int/cache/ac/<sha1_hash>. As mentioned before, see the
Threat Modeling section at the end of this page for more discussion on hash
formats.
2.2 SCons Integration
2.2.1 New Command-Line Flags
We propose the introduction of the following flags to SCons to control remote
caching behavior:
1. --remote-cache-url
a. URL of remote cache server
b. Default: empty string
2. --remote-cache-fetch-enabled
a. When True, fetches out-of-date tasks from remote cache before compilation
b. Default: False
c. Requires: --remote-cache-url
3. --remote-cache-push-enabled
a. When True, pushes tasks to remote cache after compilation
b. Default: False
c. Requires: --remote-cache-url
4. --remote-cache-connections
a. Specifies the number of threads that the RemoteCache class will maintain
to dispatch asynchronous cache requests. Only applies if push and/or fetch is
asynchronous.
b. Default: 20
All parameters will also be settable via SetOption so that SConscripts can
override dynamically.
2.2.2 RemoteCache Module
The RemoteCache module owns all network communication. It requires that the
following modules be available:
1. urllib3
2. concurrent.futures (specifically ThreadPoolExecutor)
This class is responsible for doing remote cache push and fetch. It is entirely
platform-agnostic and is intended to have no platform-specific code. The
interface is as follows:
def raise_if_not_supported():
"""Raises an exception if the requires libraries are not available"""
class RemoteCache:
def async_enabled(self):
"""Returns True if asynchronous fetch and/or push are enabled."""
def set_fetch_response_queue(self, queue):
"""
Sets the queue used to report cache fetch results if fetching
is enabled.
"""
def fetch_task(self, task):
"""
Dispatches a request to a helper thread to fetch a task from the
remote cache.
Returns True if we submitted the task to the thread pool and False
in all other cases.
"""
def push_task(self, task):
"""
Dispatches a request to a helper thread to push a task to the
remote cache.
"""
def close(self):
"""
Releases any resources that this class acquired (e.g. the ThreadPool
executor)
Note: This function may be removed if we move ownership of the
ThreadPool to
the consumer for testability purposes
"""
2.2.3 New Job Scheduler For Fetches
As mentioned earlier, we will introduce a new job scheduler to be used when
fetches are enabled. This scheduler is needed for two main reasons:
1. The existing Parallel scheduler is not efficient because it waits on jobs to
be done when it should be fetching more tasks (see SCons pull request 3386 for
more discussion)
2. The existing Parallel scheduler manages only one Queue, the job queue, while
the new scheduler will need to manage two Queues: the job queue and the cache
fetch result queue
The goal of this scheduler is to most efficiently manage the following three
responsibilities:
1. Scanning for ready tasks using taskmaster.next_task()
2. Processing asynchronous cache fetch results
3. Processing action-related job results
The existing Parallel scanner did steps #1 and #3 above, but wasted time
waiting on #3. This new scanner is expected to be more efficient because it
only waits for queue results (steps #2 and #3 above) if it expects all other
steps to already be exhausted. So for example, it would wait on cache fetch
results only if the last run of taskmaster.next_task() returned None and we
have no active jobs. Or as another example, it would wait on action-related job
results only if the last run of taskmaster.next_task() returned None and we
have no active asynchronous cache fetches.
The behavior of this job scheduler can be summarized using the following
pseudocode:
class ParallelRemoteCache(Parallel):
def start(self):
while True:
if we expect to have tasks left to scan:
get the next task to execute
if task is found and it is out of date:
try to fetch task asynchronously using the RemoteCache class
if we don't have a task, there are no active jobs, and there are no
pending cache fetches:
build is done!
for each pending job that is done:
process the result of the job
for each cache fetch that is done:
if it is a cache miss:
send task to job scheduler thread pool
else:
mark task as executed
3 Threat Modeling
3.1 SCons MD5 Hash Usage and Man-In-The-Middle Attacks
SCons MD5 hashes are not cryptographically secure, so attention must be paid to
areas in which there are real attack surfaces. For threat modeling purposes,
the Bazel remote cache assumes the following:
Entities you may not trust (e.g. your coworkers) can perform GET and HEAD
requests to the cache server
Only trusted entities (e.g. official build farms, Jenkins, or Travis) can
perform PUT requests to the cache server
These assumptions make us less concerned about cache poisoning. However, the
Bazel remote cache can be configured to use http, so there is a legitimate
attack surface related to man-in-the-middle attacks.
The Bazel remote cache requires that /ac/ and /cas/ requests use the same
scheme, so unfortunately we can't do https requests to /ac/ and http requests
to /cas/. With that limitation in mind, I believe that SCons MD5 hash usage
does not open us up to any new attack surfaces. That is:
If an organization uses a http transport, a man-in-the-middle can modify /ac/
results and there is nothing that we can do to prevent it.
If an organization uses an https transport, a man-in-the-middle cannot modify
any results so we do not worry about intentional nefarious hash collision.
SCons Remote Caching design.pdf
Description: SCons Remote Caching design.pdf
_______________________________________________ Scons-dev mailing list [email protected] https://pairlist2.pair.net/mailman/listinfo/scons-dev
