Welcome to the Treadmill Testbed Book!
This book is the primary documentation of the Treadmill Testbed project, software and hardware. It is divided up into the following main chapters:
-
The User Guide: start here if you are new to Treadmill and you want to run interactive development sessions or use it to run some automated Continuous Integration workloads.
-
Documentation of the public treadmill.ci deployment: this chapter documents individual deployments (sites) that are connected to the main Treadmill Switchboard instance hosted at
swx.treadmill.ci
. You can find the available hardware resources and documentation of the individual setups there. -
The Operator's Guide: This chapter contains guides and other useful information for operators of the Treadmill switchboard or site deployments.
-
Internals: Documentation of the components that make up Treadmill, such as the switchboard, puppet, or supervisor crates.
Terminology
Treamill is a distributed system composed of multiple different components. While individual deployments may differ and feature different sets of components, this guide establishes the following general terminology:
The Treadmill testbed / the Treadmill system
Describes the overall Treadmill system, including all deployments and central components. Excludes external actors, such as users or platforms that interact with the Treadmill system.
Device Under Test / DUT
A chip, board, or other device that can be programmed, debugged, interacted with, or otherwise controlled by users or the Treadmill testbed. Typically, DUTs will be development boards that feature microcontrollers or SoCs, like a Nordic Semiconductor nRF52840DK board.
Site
A collection of DUTs and other shared, non-global infrastructure. A site has one or more DUTs, companion software and hardware per DUT, and also includes central software or hardware components shared among DUTs.
Deployment
A physical deployment of a part of a Treadmill system, hosting one or more DUTs. A deployment may consist of multiple Treadmill sites. A deployment is not a concept used in the Treadmill system itself, but used in these and other documents to denote parts of the Treadmill system that are in physically distinct locations and/or under differing administrative control.
Switchboard
The single, centralized controller of a Treadmill testbed. A single Switchboard instance coordinates and orchestrates workloads across multiple Treadmill sites. It also implements central authentication and authorization mechanisms and is the authoritative data source for many of these subsystems.
Supervisor
A supervisor is responsible for managing interactions with a DUT and runs on site-local infrastructure. Examples are QEMU supervisors that manage virtual machines connected to a DUT, or Netboot Supervisors that exercise control over other hosts connected to a DUT. Supervisors connect to and are managed by the Switchboard.
Host
A device or environment exposed to users of the Treadmill system, connected to a DUT. Hosts can be virtual machines, dedicated hardware hosts, containers, or other environments. Hosts run software that is able to interact with and control a DUT. Each Host is managed by a Supervisor.
User Guide
Integrating Treadmill with GitHub Actions
By integrating Treadmill into your GitHub Actions workflows you can automatically test your code on real hardware while retaing the convenience of GitHub Action's reusable actions snippets, declarative configuration, and integration with GitHub repositories. This guide describes how to
- wire up your repository such that it can automatically launch Treadmill jobs for some Actions CI workflow,
- attach a GitHub actions runner on the Treadmill host to your repository, and
- define a Workflow that launches a set of Treadmill jobs to test different parts of your code on a set of different hardware boards.
Just-in-time GitHub Actions Runners
GitHub Actions jobs usually execute on hosted runners provided by
GitHub. These runners can be selected by using the appropriate
runs-on
selector, for instance by specifying a label such as ubuntu-latest
. However,
these runners do not have access to hardware targets.
In contrast, a Treadmill job provides access to an ephemeral host environment, running a user-supplied image, which itself has access to a hardware target. This host can be used for interactive sessions or, when supplied with an appropriate image, automatically execute some software on startup.
We can use this abstraction that Treadmill provides to run a GitHub Actions self-hosted runner. This software provides the ability to execute regular GitHub actions workflows on a user-supplied compute resource instead of a GitHub-provided hosted runner. However, typically these runners execute workflows on this machine without strong isolation (without a container or VM) and are long-lived: once registered, they are able to run multiple workflows up to the point they are deregistered. We must ensure that workflows cannot accidentally or maliciously influence future workflow runs, and run GitHub workflows on exactly the Treadmill host they are supposed to execute on.
Treadmill's ephemeral job environments can alleviate the first concern: Treadmill jobs always run from a clean, ephemeral image and thus provide a reasonable degree of isolation between jobs. We can ensure the second property using GitHub's concept of just-in-time runners: these Actions runners can be registered using a single-use token and are able to execute exactly one job. This ensures that each GitHub actions workflow runs in a fresh Treadmill job environment, and this job is only able to execute exactly one workflow.
Automatically Launching a Treadmill Job & GitHub Actions Runner
Treadmill does not have a native GitHub actions integration. Instead, we can use
a small Actions workflow job running on a GitHub hosted runner (called
test-prepare
) to prepare a new Treadmill job and register it as a self-hosted
runner, to then launch actual test workloads (called test-execute
) on this
runner in a second step.
We start by defining a workflow file in our repository that runs a job called
treadmill-ci
, under .github/workflows/treadmill-ci.yml
. This test-prepare
workflow will first compile the Treadmill CLI client and proceed to log into the
Treadmill testbed.
name: treadmill-ci
# You can customize these triggers to your preference:
on:
pull_request: # Run CI for PRs on any branch
merge_group: # Run CI for the GitHub merge queue
jobs:
test-prepare:
# Run this first step on GitHub's hosted infrastructure
runs-on: ubuntu-latest
# Expose a few values to the test-execute job:
outputs:
runner-id: ${{ steps.gh-actions-jit-runner-config.outputs.runner-id }}
tml-job-id: ${{ steps.treadmill-job-launch.outputs.tml-job-id }}
steps:
# Required to compile the Treadmill CLI client:
- uses: actions-rust-lang/setup-rust-toolchain@v1
# Fetch the source of the Treadmill CLI client. We do not yet provide
# pre-compiled binaries:
- name: Checkout Treadmill repository
uses: actions/checkout@v4
with:
repository: treadmill-tb/treadmill
path: treadmill
# This greatly speeds up future workflow runs:
- name: Cache Treadmill CLI compilation artifacts
id: cache-tml-cli
uses: actions/cache@v4
with:
path: treadmill/target
key: ${{ runner.os }}-tml-cli
- name: Compile the Treadmill CLI binary
run: |
pushd treadmill
cargo build --package tml-cli
popd
echo "$PWD/treadmill/target/debug" >> "$GITHUB_PATH"
Registering a Just-in-time GitHub Actions Runner
Next, we need to create a registration token for the just-in-time GitHub actions runner that will run on a Treadmill host. Unfortunately, GitHub does not provide an easy way to do this from within a GitHub action. Notably, the GitHub API token that is provided to GitHub Actions workflows by default does not have the required capabilities to create new just-in-time GitHub Actions runners. Instead, we have to first create a GitHub App with the required permissions and use it's API token to register the runner.
To create a new GitHub App, navigate to your Organization Settings
→
Developer Settings
→ GitHub Apps
. Here, you are able to create a new
application like the following. You can leave most fields blank, as we're not
actually using any of the app's features apart from its API token:
Disable Webhook
→ Active
and leave the Webhook URL
and Secret
empty. Under Permissions
, the app requires only the Repository permissions
→
Administration
to be set to Access: Read and write
. Set Where can this GitHub App be installed
→ Only on this account
.
With the app created, you should be presented with its settings page. We do not
need to generate a client secret for the app. Instead, scroll down until you see
the Private keys
section and click Generate a private key
: Your browser should download
a .pem
file after a short while:
We need to make this private key accessible to the GitHub Actions workflow. For
this, navigate to your repository's Settings
(not the organization settings) →
Secrets and variables
→ Actions
. Create a new Actions variable called
TREADMILL_GH_APP_CLIENT_ID
and set its contents to your application's client
ID (from its settings page). Create a new Actions secret called
TREADMILL_GH_APP_PRIVATE_KEY
and copy the contents of the downloaded .pem
file into the secret:
Finally, we need to install this application into the target repository. For
this, navigate to your application settings (under Organization Settings
→
Developer Settings
→ GitHub Apps
→ your application name → Edit
), select
Install App
, and click Install
next to your organization. You can choose to
only install the app for one repository as shown below:
With the app ready, we can extend the GitHub actions workflow to obtain an API token that is able to create new just-in-time runners from within our workflow:
- name: Generate a token to register new just-in-time runners
id: generate-token
uses: actions/create-github-app-token@v1
with:
app-id: ${{ vars.TREADMILL_GH_APP_CLIENT_ID }}
private-key: ${{ secrets.TREADMILL_GH_APP_PRIVATE_KEY }}
owner: ${{ github.repository_owner }}
Finally, we can create a new just-in-time runner in a subsequent step:
- name: Create GitHub just-in-time runner
id: gh-actions-jit-runner-config
env:
GH_TOKEN: ${{ steps.generate-token.outputs.token }}
run: |
# Create a unique string that identifies this runner across all
# workflow invocations and attempts in this repository:
RUNNER_ID="tml-gh-actions-runner-${GITHUB_REPOSITORY_ID}-${GITHUB_RUN_ID}-${GITHUB_RUN_ATTEMPT}"
# Perform the API request to register the just-in-time runner.
RUNNER_CONFIG_JSON="$(gh api \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
/repos/$YOUR_ORG/$YOUR_REPO/actions/runners/generate-jitconfig \
-f "name=$RUNNER_ID" \
-F "runner_group_id=1" \
-f "labels[]=$RUNNER_ID" \
-f "work_folder=_work")"
# The above returns a JSON object containing a base64-encoded
# "jit config". We need to retain this value for starting the runner.
# Provide it to subsequent steps as an output:
echo "jitconfig=$(echo "$RUNNER_CONFIG_JSON" | jq -r '.encoded_jit_config')"
# The test-execute workflow will need to match on a specific
# runner-label assigned to the self-hosted runner. Export our
# runner-id here, which we've set as a label above:
echo "runner-id=$RUNNER_ID"
Next, we'll pass this value into the Treadmill job's parameters and start a job that launches a GitHub actions runner on boot.
Launching a Treadmill Job
With the runner registration token generated, we're ready to launch the Treadmill job that will ultimately host this runner. For this, we add another workflow step as follows:
- name: Create GitHub just-in-time runner
id: treadmill-launch-job
env:
TML_API_TOKEN: ${{ secrets.TREADMILL_API_TOKEN }}
# A Treadmill GitHub Actions image which includes the self-hosted runner:
IMAGE_ID: "d407b09b9f56c666d0d3350890e364ba16aad08b484f4ca1de19d42569cc79b1"
DUT_BOARD: "nrf52840dk"
run: |
echo "Enqueueing Treadmill job:"
# Manually create a JSON object that specifies the job parameters and
# contains the registration token for the GitHub actions runner.
#
# The Treadmill GitHub Actions images will search for this parameter
# and use it to configure their included self-hosted runner:
TML_JOB_PARAMETERS="{\
\"gh-actions-runner-encoded-jit-config\": {\
\"secret\": true, \"value\": \"${{ steps.gh-actions-jit-runner-config.outputs.jitconfig }}\"\
}
}"
# Finally, run the `job enqueue` command. You can optionally specify
# SSH keys for interactive debugging:
TML_JOB_ID_JSON="$(tml job enqueue \
"IMAGE_ID" \
--tag-config "board:$DUT_BOARD" \
--parameters "$TML_JOB_PARAMETERS" \
)"
TML_JOB_ID="$(echo "$TML_JOB_ID_JSON" | jq -r .job_id)"
echo "Enqueued Treadmill job with ID $TML_JOB_ID"
# Pass the job IDs and other configuration data into the outputs of
# this step, such that we can run test-execute job instances for each
# Treadmill job we've started:
echo "tml-job-id=\"$TML_JOB_ID\"" >> "$GITHUB_OUTPUT"
We provide another repository secret, called TML_API_TOKEN
, to this step. The
tml
CLI client will detect this environment variable and use the API token to
authenticate against the Switchboard API.
This step takes in the jitconfig
output from the previous step and enqueues a
new Treadmill job that is parameterized over this value. It is important to set
the Treadmill image ID to an image which is configured to run a GitHub Actions
self-hosted runner on bootup and performs the necessary configuration based on
the gh-actions-runner-encoded-jit-config
parameter.
After this step is executed, the Treadmill testbed will launch this job on an
appropriate host (selected by tag board:$DUT_BOARD
) and the host will register
a new GitHub actions runner. We now define the part of the Actions workflow file
that runs on the Treadmill host itself.
Running a GitHub Actions Job on the Treadmill Host
To run a job on our newly started host, we add another job definition to our
workflow file. Importantly, this second test-execute
job has a dependency on
the first test-prepare
job. It also selects the unique RUNNER_ID
that we've
generated above as its runs-on
target. This ensures that this job will only be
eligible on the Treadmill host that we've requested for it. We can then proceed
to run regular steps, as we would in any other GitHub Actions workflow file:
test-execute:
needs: test-prepare
runs-on: ${{ needs.test-prepare.outputs.runner-id }}
steps:
- name: Print Treadmill Job Context and Debug Information
run: |
echo "Treadmill job id: ${{ needs.test-prepare.outputs.tml-job-id }}"
echo "GitHub Actions Runner ID: ${{ needs.test-prepare.outputs.runner-id }}"
echo "Network configration:"
ip address
echo "Attached USB devices:"
lsusb
echo "Parameters:"
ls /run/tml/parameters
- uses: actions/checkout@v4
- uses: actions-rust-lang/setup-rust-toolchain@v1
- name: Build the Tock kernel
run: |
pushd boards/nordic/nrf52840dk
unset RUSTFLAGS
make
popd
Deployments
treadmill.ci Public Images
For now, we manually update the index of all files on the image server:
[root@sns31:/var/www/a.images.treadmill.ci]# find -type f ! -path './image.txt' ! -path './all.txt' > all.txt
To create a mirror of this image store based on this index, you can
use the following bash snippet, which will create the Treadmill image
store structure (images/
and blobs/
) in your current directory and
download all files not already present:
$ wget -r -p -E -K -np -nH -nc --content-disposition --trust-server-names --no-http-keep-alive -i <(wget -O- -o/dev/null https://a.images.treadmill.ci/all.txt | sed 's|^\./|https://a.images.treadmill.ci/|')
vm-ubuntu-2204-amd64-uefi
Versions:
Build Date | git Revision | Image ID |
---|---|---|
2024-10-13 | 68cfe43a22 | 616a372120b0afce9310a07c2e3b4c897b9cbccdfec4cf01ccbcca82c156ee05 |
2024-09-24 | f7f6a60239 | 4864215aff5840792f3f871cb74d0e74170b199406a56422612efa715e72e1a5 |
Build logs:
68cfe43a22
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> git rev-parse HEAD
68cfe43a225bf83bba4fe3fe11723bda7da9c45f
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> nix-build -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/fcssv6py0hh2p4hfd9w5h9pl5d3ysz5p-treadmill-store
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/6c/
blobs/6c/82/
blobs/6c/82/47/
blobs/6c/82/47/6c8247e4440a4f9a691f67643c1d2adf87d48b6c475bd7b83599851cec785164
images/61/
images/61/6a/
images/61/6a/37/
images/61/6a/37/616a372120b0afce9310a07c2e3b4c897b9cbccdfec4cf01ccbcca82c156ee05
sent 941,459,348 bytes received 113 bytes 81,866,040.09 bytes/sec
total size is 941,229,060 speedup is 1.00
f7f6a60239
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> git rev-parse HEAD
f7f6a6023970684ab56515fcdedf1b5792f368f7
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> nix-build -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/1bjwlkjbxq7nal5sbll6snh9wc0ingbv-treadmill-store
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/
blobs/33/
blobs/33/31/
blobs/33/31/75/
blobs/33/31/75/33317569a76291991bb8dae68a08b2369221a229192eec1ad3227d38826da281
images/
images/48/
images/48/64/
images/48/64/21/
images/48/64/21/4864215aff5840792f3f871cb74d0e74170b199406a56422612efa715e72e1a5
sent 940,869,394 bytes received 113 bytes 89,606,619.71 bytes/sec
total size is 940,639,236 speedup is 1.00
vm-ubuntu-2204-amd64-uefi
+ GitHub Actions Runner
Versions:
Build Date | git Revision | Image ID |
---|---|---|
2024-10-13 | 68cfe43a22 | 9ac6e2f62fec7d41d81df9a3b2fc40f5b4efa3e94055ea43a83e29dc77b791ee |
2024-09-24 | f7f6a60239 | 0373bb7d728b36cb6083cfe12f27038b71972ceb90563b0037d4012df7b62bf4 |
Build logs:
68cfe43a22
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> git rev-parse HEAD
68cfe43a225bf83bba4fe3fe11723bda7da9c45f
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> nix-build gh-actions-overlay.nix
/nix/store/25p7sbadzw5rj7b1dz23zxacw0ri8nzr-image-store
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/06/
blobs/06/ff/
blobs/06/ff/9f/
blobs/06/ff/9f/06ff9fbb107733147c0ab2bd92efd4a2844b42c9ec60945d8e84de1b6194ed61
blobs/6c/82/47/6c8247e4440a4f9a691f67643c1d2adf87d48b6c475bd7b83599851cec785164
images/9a/
images/9a/c6/
images/9a/c6/e2/
images/9a/c6/e2/9ac6e2f62fec7d41d81df9a3b2fc40f5b4efa3e94055ea43a83e29dc77b791ee
sent 658,136,207 bytes received 214,956 bytes 77,453,078.00 bytes/sec
total size is 1,599,080,209 speedup is 2.43
f7f6a60239
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> git rev-parse HEAD
f7f6a6023970684ab56515fcdedf1b5792f368f7
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> nix-build gh-actions-overlay.nix
/nix/store/yzn9rhawqslvl8y7b55sq6n19lhlcxrx-image-store
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/33/31/75/33317569a76291991bb8dae68a08b2369221a229192eec1ad3227d38826da281
blobs/9b/
blobs/9b/bc/
blobs/9b/bc/f6/
blobs/9b/bc/f6/9bbcf6d6a67886ac58b9d6cdbb87b49e1a14ebeb8b19b99279b3d73eacdf00b0
images/03/
images/03/73/
images/03/73/bb/
images/03/73/bb/0373bb7d728b36cb6083cfe12f27038b71972ceb90563b0037d4012df7b62bf4
sent 658,398,373 bytes received 214,879 bytes 69,327,710.74 bytes/sec
total size is 1,598,752,529 speedup is 2.43
netboot-raspberrypi-nbd
Versions:
Build Date | git Revision | Image ID |
---|---|---|
2024-10-30 | a4d1690d1e | a1b25b7cd0cbea2abd2ee472761180049f31c736095f81c16c65a5877e9f2c44 |
2024-10-15 | 6803d17a74 | 5db0bcba4ca3295c83d8cb0318651b78469b90cda9f124011c2bd15a0f1f8999 |
2024-10-13 | 68cfe43a22 | f0617619bfb9a459a42b70101af65ef6b8d34631955f1d46423674e9897f26fc |
2024-10-12 | 914501ec25 | 453facb39f3d786a3ab3075358665fca850025e5b342487066f7a5c5482bd8ab |
Build logs:
a4d1690d1e
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
a4d1690d1ef9c2e330a71237913279ab90ca545d
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/nakjliab1q6cd0l3f1v2zl3c65wghfl7-treadmill-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/6a/
blobs/6a/d5/
blobs/6a/d5/57/
blobs/6a/d5/57/6ad557ac9249f743b56124e0157b01e5cf28fcd5c45222b48ced16804b17eb09
blobs/9d/
blobs/9d/30/
blobs/9d/30/51/
blobs/9d/30/51/9d30513e0dc24566abb2271f269e6509db9866d313a5e0445e3afe9029542947
images/a1/
images/a1/b2/
images/a1/b2/5b/
images/a1/b2/5b/a1b25b7cd0cbea2abd2ee472761180049f31c736095f81c16c65a5877e9f2c44
sent 2,163,215,797 bytes received 148 bytes 105,522,729.02 bytes/sec
total size is 2,162,687,123 speedup is 1.00
6803d17a74
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
6803d17a74a4158e80fc6bc6fe44c64543ff0d15
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build -I nixpkgs=https://github.com/nixos/nixpkgs/archive/release-24.05.tar.gz -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/b4vwspja2w7zp8slajn4zb6xydz6bdp8-treadmill-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/50/a3/39/50a339bb4ec10902d7bae426fe216a8008fca81fa82ce9a8036ebad998320c98
blobs/e4/4b/bd/e44bbd64b70c8afea5f704e8b6884f7d52bee81c75b84ac443bb77e45901acbf
images/5d/b0/bc/5db0bcba4ca3295c83d8cb0318651b78469b90cda9f124011c2bd15a0f1f8999
sent 217,190 bytes received 378,929 bytes 51,836.43 bytes/sec
total size is 2,155,740,307 speedup is 3,616.29
68cfe43a22
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
68cfe43a225bf83bba4fe3fe11723bda7da9c45f
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/wc452qz6yp2fy7qdlk0sn71rbcsky45g-treadmill-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/33/24/
blobs/33/24/52/
blobs/33/24/52/3324528e034d27c28f4b58b734aab3e0b041a1c57c044bcef1a3c552ff88665a
blobs/50/
blobs/50/16/
blobs/50/16/df/
blobs/50/16/df/5016df56e359098cb3c6e44bee77ee390c71e855908e4b0a528cbf4ba5d37f4f
images/f0/
images/f0/61/
images/f0/61/76/
images/f0/61/76/f0617619bfb9a459a42b70101af65ef6b8d34631955f1d46423674e9897f26fc
sent 2,159,544,890 bytes received 145 bytes 105,343,660.24 bytes/sec
total size is 2,159,017,107 speedup is 1.00
914501ec25
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
914501ec25617613d8bc4d5ca034438e3030acf3
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/8yhb8zc7n0dj1a1y9gc1n8l9w84firk8-treadmill-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/1d/67/24/1d6724e19dee478cc8b6b6e09cd8d3ba415818aac605acf4a7679159f246dcbf
blobs/44/a2/5a/44a25acaf1e384ffd6926d613cca854563bc62ad6515e1645ac4151f51c55054
images/45/
images/45/3f/
images/45/3f/ac/
images/45/3f/ac/453facb39f3d786a3ab3075358665fca850025e5b342487066f7a5c5482bd8ab
sent 222,162 bytes received 385,681 bytes 52,855.91 bytes/sec
total size is 2,098,687,124 speedup is 3,452.68
netboot-raspberrypi-nbd
+ GitHub Actions Runner
Versions:
Build Date | git Revision | Image ID |
---|---|---|
2024-10-30 | a4d1690d1e | df24da6c7a03d87b1b6b55162383a9dfdf48a129b5f3e648748f0f9d11cdb470 |
2024-10-15 | 6803d17a74 | 1b6900eff30f37b6d012240f63aa77a22e20934e7f6ebf38e25310552dc08378 |
2024-10-13 | 68cfe43a22 | 5f4b61324c27472b5354cd11229a0936320148cd6e852fbf05e1b7ff5b4598e6 |
2024-09-24 | 914501ec25 | df8337148b0b3c63b400955b7ea49b202f34ecb111b61cd60c45a96076d9e31a |
Build logs:
a4d1690d1e
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main) [1]> git rev-parse HEAD
a4d1690d1ef9c2e330a71237913279ab90ca545d
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build gh-actions-runner-overlay.nix
/nix/store/v751jk869i22ppplffkrc0c5jvaqbivg-image-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/30/
blobs/30/1b/
blobs/30/1b/82/
blobs/30/1b/82/301b824fdf94fe658a389fdc6cf147e0ff4c1f06c4403a74d00331f6aebb1798
blobs/6a/d5/57/6ad557ac9249f743b56124e0157b01e5cf28fcd5c45222b48ced16804b17eb09
blobs/9d/30/51/9d30513e0dc24566abb2271f269e6509db9866d313a5e0445e3afe9029542947
images/df/24/
images/df/24/da/
images/df/24/da/df24da6c7a03d87b1b6b55162383a9dfdf48a129b5f3e648748f0f9d11cdb470
sent 470,751,703 bytes received 379,466 bytes 32,491,804.76 bytes/sec
total size is 2,633,105,217 speedup is 5.59
6803d17a74
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
6803d17a74a4158e80fc6bc6fe44c64543ff0d15
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build -I nixpkgs=https://github.com/nixos/nixpkgs/archive/release-24.05.tar.gz gh-actions-runner-overlay.nix
/nix/store/148134wsj8h3jbaz6gn7dl1igywgg48a-image-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/50/a3/39/50a339bb4ec10902d7bae426fe216a8008fca81fa82ce9a8036ebad998320c98
blobs/e4/4b/bd/e44bbd64b70c8afea5f704e8b6884f7d52bee81c75b84ac443bb77e45901acbf
blobs/f8/
blobs/f8/d0/
blobs/f8/d0/61/
blobs/f8/d0/61/f8d06173c89ea48fb3c5214a7f16c3fb2c5964732602dcd230d535984d23e206
images/1b/
images/1b/69/
images/1b/69/00/
images/1b/69/00/1b6900eff30f37b6d012240f63aa77a22e20934e7f6ebf38e25310552dc08378
sent 470,030,344 bytes received 378,961 bytes 30,348,987.42 bytes/sec
total size is 2,625,437,505 speedup is 5.58
68cfe43a22
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
68cfe43a225bf83bba4fe3fe11723bda7da9c45f
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build gh-actions-runner-overlay.nix
/nix/store/wcihc56rzaqhbvqj0amzza8qk6ss69sv-image-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/33/24/52/3324528e034d27c28f4b58b734aab3e0b041a1c57c044bcef1a3c552ff88665a
blobs/4f/
blobs/4f/5d/
blobs/4f/5d/5f/
blobs/4f/5d/5f/4f5d5fb9780430b4fa4b8747c74af7d60f8a4e1f5accb3cd9871d66bf674b8ca
blobs/50/16/df/5016df56e359098cb3c6e44bee77ee390c71e855908e4b0a528cbf4ba5d37f4f
images/5f/
images/5f/4b/
images/5f/4b/61/
images/5f/4b/61/5f4b61324c27472b5354cd11229a0936320148cd6e852fbf05e1b7ff5b4598e6
sent 469,637,168 bytes received 379,189 bytes 30,323,635.94 bytes/sec
total size is 2,628,321,089 speedup is 5.59
914501ec25
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
914501ec25617613d8bc4d5ca034438e3030acf3
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build gh-actions-runner-overlay.nix
/nix/store/i0mqkn0ygp5zn7d1fd10h0z5msqav7vf-image-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/1d/
blobs/1d/67/
blobs/1d/67/24/
blobs/1d/67/24/1d6724e19dee478cc8b6b6e09cd8d3ba415818aac605acf4a7679159f246dcbf
blobs/44/
blobs/44/a2/
blobs/44/a2/5a/
blobs/44/a2/5a/44a25acaf1e384ffd6926d613cca854563bc62ad6515e1645ac4151f51c55054
blobs/55/
blobs/55/57/
blobs/55/57/dc/
blobs/55/57/dc/5557dc4e01ee4e2b4698931332b38a754c55f9da9ff48c7de8d4728fdf9683d1
images/df/
images/df/83/
images/df/83/37/
images/df/83/37/df8337148b0b3c63b400955b7ea49b202f34ecb111b61cd60c45a96076d9e31a
sent 2,570,257,715 bytes received 183 bytes 100,794,427.37 bytes/sec
total size is 2,569,629,506 speedup is 1.00
treadmill.ci Sites
Sites:
Site pton-srv0
ID | Type | Board | Host | SSH Endpoint |
---|---|---|---|---|
0679be07... | Netboot (NBD) | Nordic Semiconductor nRF52840DK | Raspberry Pi 5 8GB | sns30.cs.princeton.edu:22006 |
0af84b36... | QEMU VM | Nordic Semiconductor nRF52840DK | sns30.cs.princeton.edu:22030 | |
1bdc10a7... | QEMU VM | Nordic Semiconductor nRF52840DK | sns30.cs.princeton.edu:22026 | |
25b97cf7... | QEMU VM | Nordic Semiconductor nRF52840DK | sns30.cs.princeton.edu:22034 | |
524aa422... | Netboot (NBD) | Nordic Semiconductor nRF52840DK | Raspberry Pi 5 8GB | sns30.cs.princeton.edu:22002 |
56f98833... | Netboot (NBD) | STMicroelectronics NUCLEO-F429ZI | Raspberry Pi 5 8GB | sns30.cs.princeton.edu:22014 |
64e5e94d... | Netboot (NBD) | Digilent Arty-A7 35T | Raspberry Pi 5 8GB | sns30.cs.princeton.edu:22018 |
8723bd6d... | Netboot (NBD) | Nordic Semiconductor nRF52840DK | Raspberry Pi 5 8GB | sns30.cs.princeton.edu:22010 |
8ff22e8e... | Netboot (NBD) | Nordic Semiconductor nRF52840DK | Raspberry Pi 5 8GB | sns30.cs.princeton.edu:22022 |
fb1384d5... | Netboot (NBD) | Nordic Semiconductor nRF52840DK Cluster (4 Boards) | Raspberry Pi 5 8GB | sns30.cs.princeton.edu:22038 |
Operator's Guide
Admin Snippets
Create a new Switchboard Postgres SQL Database
CREATE DATABASE "treadmill_switchboard" WITH OWNER "treadmill_switchboard" ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8';
Create a new user with admin privileges
This will prompt for username and email, automatically generate a password and dump an SQL transaction to insert the user & privilege assignment into the database:
nix-shell -p 'python3.withPackages (pypkgs: with pypkgs; [ argon2-cffi ])' --run 'python3 -c "import uuid; import secrets; import argon2; name = input(\"Name: \"); email = input(\"Email: \"); password = secrets.token_urlsafe(16); hashed = argon2.PasswordHasher().hash(password); print(\"Password:\", password); id = uuid.uuid4(); print(\"\n\nvvvvv CUT HERE vvvvv\n\nbegin;\"); print(f\"INSERT INTO tml_switchboard.users (user_id, name, email, password_hash, user_type, locked) VALUES ('"'"'{id}'"'"', '"'"'{name}'"'"', '"'"'{email}'"'"', '"'"'{hashed}'"'"', '"'"'system'"'"', false);\"); print(f\"INSERT INTO tml_switchboard.user_privileges (user_id, permission) VALUES ('"'"'{id}'"'"', '"'"'admin'"'"');\"); print(\"commit;\n\n^^^^^ CUT HERE ^^^^^\")"'
Example output:
Name: testificate
Email: foo@example.org
Password: V99gZIffbREGBCGLrfB54A
vvvvv CUT HERE vvvvv
begin;
INSERT INTO tml_switchboard.users (user_id, name, email, password_hash, user_type, locked) VALUES ('e1246bc8-c3b6-4ad7-9d13-a15a2b726a63', 'testificate', 'foo@example.org', '$argon2id$v=19$m=65536,t=3,p=4$Ih9TJgPYrJQFowXzS24Vgw$aGomGlTN1tugKS7HicqtaSBoQzfKVMkU/EOqBA8q1Dw', 'system', false);
INSERT INTO tml_switchboard.user_privileges (user_id, permission) VALUES ('e1246bc8-c3b6-4ad7-9d13-a15a2b726a63', 'admin');
commit;
^^^^^ CUT HERE ^^^^^
Make deployment configuration changes on the supervisor server & push locally
Assuming the Treadmill deployments repo is cloned at
/var/state/treadmill-deployments
on machine tockci-pton-srv0
, we
can make local edits to this repository on that machine and test them
immediately:
[root@tockci-pton-srv0:/var/state/treadmill-deployments]# echo "hello world" > foo
[root@tockci-pton-srv0:/var/state/treadmill-deployments]# nixos-rebuild test # test the changes
Now, assuming that everything works, we want to commit these changes back to the deployments repository upstream, without giving the machine push access. For this, create a commit on the remote machine. We avoid persistently setting a Git committer name or email, as the machine may be shared amongst multiple admins:
[root@tockci-pton-srv0:/var/state/treadmill-deployments]# git \
-c user.name="Testificate" \
-c user.email="testificate@example.org" \
commit -m "Important changes"
[main 161743c] Important changes
1 file changed, 1 insertion(+)
create mode 100644 foo
Now, on your local machine, in the deployments repository, we can fetch this commit without setting up a git remote like so:
testificate@laptop treadmill-tb/deployments (main)> git fetch root@tockci-pton-srv0:/var/state/treadmill-deployments
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
Unpacking objects: 100% (3/3), 266 bytes | 266.00 KiB/s, done.
From tockci-pton-srv0:/var/state/treadmill-deployments
* branch HEAD -> FETCH_HEAD
We can apply these fetched changes onto our local branch like so:
-
In case the changes apply cleanly:
testificate@laptop treadmill-tb/deployments (main)> git merge --ff-only FETCH_HEAD Updating a0c7fd6..161743c Fast-forward foo | 1 + 1 file changed, 1 insertion(+) create mode 100644 foo
-
In case the refs have diverged:
testificate@laptop treadmill-tb/deployments (main)> git rebase FETCH_HEAD Successfully rebased and updated refs/heads/main. testificate@laptop treadmill-tb/deployments (main)> git rebase origin/main Successfully rebased and updated refs/heads/main.
In this case, the first rebase puts all the divergent commits on top of what we've fetched from the Treadmill supervisor machine, and the second inverts this: the machine commits will be applied on top of the changes in our push remote. Replace
origin/main
with your target branch as appropriate.
Push the changes to the upstream remote:
testificate@laptop treadmill-tb/deployments (main)> git push
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 16 threads
Compressing objects: 100% (6/6), done.
Writing objects: 100% (7/7), 815 bytes | 815.00 KiB/s, done.
Total 7 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 1 local object.
To github.com:treadmill-tb/deployments.git
a0c7fd6..161743c main -> main
And finally, fetch the new history back onto the Treadmill supervisor machine:
[root@tockci-pton-srv0:/var/state/treadmill-deployments]# git pull --rebase
From https://github.com/treadmill-tb/deployments
a0c7fd6..161743c main -> origin/main
Already up to date.
This last step will sync the (rebased) history back onto the Treadmill deployments machine.
Internals
Job Lifecycle
Treadmill jobs represent units of work schedulable on a supervisor. Each job is eligible to run on a set of supervisors, limited by a set of tag filter expressions governed by the job request, and the permissions of the user who scheduled the job.
From the point of their creation up to their successful completion or failure, jobs go through a set of state changes. A job's state is composed of two components: its execution state, and its exit status.
A job's execution state describes the current state of the schedulable unit
of work and is controlled by the Switchboard (e.g., by assigning a job on a
particular supervisor), and the Supervisor (e.g., by starting or terminating a
Virtual Machine). A job may not necessarily transition through all defined
execution states; states may be arbitrarily skipped. All jobs must eventually
end up in the Terminated
state. Only some transitions between different
execution states are legal. When attempting to take an illegal transition
towards another execution state, the Switchboard may ignore this transition, or
attempt to terminate the given job. A job that reached the Terminated
state
must not transition to other states; a Terminated
execution state is
final.
The exit status is controlled by the supervisor and describes the user-visible
state that the job assumes once it reaches the Terminated
execution state. A
job's exit status may be set multiple times, but must not change once the job's
execution state reaches Terminated
. The execution state can be used to
communicate whether the Treadmill system was able to successfully schedule the
job, whether there were any Treadmill system-internal errors that prevented its
successful execution, and whether the user-defined job workload reported a
success or error result. Only some transitions between different exit statuses
are legal. When attempting to take an illegal transition towards another exit
status, the previous exit status remains valid.
We describe these two components as Rust enums below.
Execution State
#![allow(unused)] fn main() { enum ExecutionState { /// A job object has been created, but it has not been assigned to a /// supervisor yet. /// /// This is the starting state for newly created jobs. Queued, /// A job object has been assigned to a particular supervisor. Scheduled, /// A job is scheduled on a particular supervisor and is starting or /// restarting. /// /// An `Initializing` job may itself report different sub-states which /// indicate progress while starting the job. These are for informational /// purposes only. Not all restarts of a job's host will re-enter the /// `Starting` state. Initializing, /// The job is fully started and ready to execute user-defined workloads. /// /// A `Ready` job may itself report different sub-states which indicate /// progress or certain events, such as a soft-reboot of the job's host. /// These are for informational purposes only. Ready, /// The job has been requested to terminate. /// /// A `Terminating` job may itself report different sub-states which /// indicate progress of requesting a host shutdown, deallocating of /// resources, and other events. Terminating, /// The job has been terminated. /// /// This state is final. The job must not transition into any other /// execution states, and its exit status must not change. Terminated, } }
A transition into Queued
may only be performed by the Switchboard. A
transition into the Initializing
, Ready
, and Terminating
state may only be
performed by a Supervisor. The Terminated
state can be reached by either
- an explicit state transition initiated by a Supervisor, or
- the Switchboard, when it observes that a Supervisor is no longer reporting to
be executing a job that was once
Scheduled
on it. In this case, the exit status shall be set toSupervisorJobDropped
. This may happen in the case of Supervisor failures or restarts.
Valid Transitions:
To → From ↓ | Q'd | Sched | Init | Ready | Term-ing | Term'd |
---|---|---|---|---|---|---|
Queued | - | ✔ | ✔ | ✔ | ✔ | ✔ |
Scheduled | ✘ | - | ✔ | ✔ | ✔ | ✔ |
Initializing | ✘ | ✘ | - | ✔ | ✔ | ✔ |
Ready | ✘ | ✘ | ✔ | - | ✔ | ✔ |
Terminating | ✘ | ✘ | ✘ | ✘ | - | ✔ |
Terminated | ✘ | ✘ | ✘ | ✘ | ✘ | - |
Exit Status
#![allow(unused)] fn main() { enum ExitStatus { /// There are no supervisors registered that this job can be scheduled on, /// considering the job's tag filter expression and the scheduling user's /// permissions. /// /// Jobs may enter this state either immediately, at the time of scheduling /// a job, or when no eligible supervisor is found within a timeout. /// /// **This exit status is final.** No subsequently reported exit status may /// override this status. SupervisorMatchError, /// There were eligible supervisors registered with the Switchboard, but /// the job could not be scheduled on one of them within a given timeout. /// /// **This exit status is final.** No subsequently reported exit status may /// override this status. QueueTimeout, /// An internal error occurred while scheduling or running this job on the /// supervisor. This state may optionally contain a message that contains /// further information on the error. /// /// This exit status may be set by both the Switchboard (e.g., when there is /// an error communicating with the Supervisor), or by the Supervisor. /// /// **This exit status is final.** No subsequently reported exit status may /// override this status. InternalSupervisorError, /// The Supervisor reports that the host failed to start. /// /// This may be due to an error in fetching the requested image, a resource /// that vanished (e.g., when trying to continue a previous job), failure /// to allocate sufficient resources, etc. /// /// **This exit status is final.** No subsequently reported exit status may /// override this status. SupervisorHostStartError, /// The job was canceled by a user. /// /// **This exit status is final.** No subsequently reported exit status may /// override this status. JobCanceled, /// The host itself reports that the user-defined workload executed /// successfully. /// /// This status may be reported through the Puppet process executing within /// the host, and may optionally contain additional user-supplied /// information. WorkloadFinishedSuccess, /// The host itself reports that the user-defined workload failed with an /// error. /// /// This status may be reported through the Puppet process executing within /// the host, and may optionally contain additional user-supplied /// information. WorkloadFinishedError, /// The host itself reports that the user-defined workload terminated, /// while indicating neither success nor failure. /// /// This status may be reported through the Puppet process executing within /// the host, and may optionally contain additional user-supplied /// information. WorkloadFinishedUnknown, /// The job vanished from its supervisor, without reaching the /// `Terminated` execution state first. /// /// This may be due to a supervisor crash or restart. This exit status is /// can only be set by the Switchboard. **This exit status is final.** No /// subsequently reported exit status may override this status. SupervisorDroppedJob, /// The job timed out while dispatched. /// /// **This exit status is final. No subsequently reported exit status may /// override this status.** JobTimeout, } }
Valid Transitions:
To → From ↓ | SupMF | QTime | IntSupE | SupHSE | JobC | WFS* | SupJDrop | JobTimeout |
---|---|---|---|---|---|---|---|---|
SupervisorMatchError | - | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
QueueTimeout | ✘ | - | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
InternalSupervisorError | ✘ | ✘ | - | ✘ | ✘ | ✘ | ✘ | ✘ |
SupervisorHostStartError | ✘ | ✘ | ✘ | - | ✘ | ✘ | ✘ | ✘ |
JobCanceled | ✘ | ✘ | ✘ | ✘ | - | ✘ | ✘ | ✘ |
WorkloadFinished* | ✘ | ✘ | ✔ | ✔ | ✔ | - | ✔ | ✔ |
SupervisorDroppedJob | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | - | ✘ |
JobTimeout | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | - |
We further impose the following restricts on transitions of exit statuses depending on the current execution state (note the flipped rows & columns for readability):
In Execution State → To Exit Status ↓ | Q'd | Sched | Init | Ready | Term-ing | Term'd |
---|---|---|---|---|---|---|
SupervisorMatchError | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ |
QueueTimeout | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ |
InternalSupervisorError | ✘ | ✔ | ✔ | ✔ | ✔ | ✘ |
SupervisorHostStartError | ✘ | ✔ | ✔ | ✔ | ✔ | ✘ |
JobCanceled | ✘ | ✔ | ✔ | ✔ | ✔ | ✘ |
WorkloadFinished* | ✘ | ✔ | ✔ | ✔ | ✔ | ✘ |
SupervisorJobDropped | ✘ | ✔ | ✔ | ✔ | ✔ | ✘ |
JobTimeout | ✘ | ✔ | ✔ | ✔ | ✔ | ✘ |