Welcome to the Treadmill Testbed Book!

This book is the primary documentation of the Treadmill Testbed project, software and hardware. It is divided up into the following main chapters:

  1. The User Guide: start here if you are new to Treadmill and you want to run interactive development sessions or use it to run some automated Continuous Integration workloads.

  2. Documentation of the public treadmill.ci deployment: this chapter documents individual deployments (sites) that are connected to the main Treadmill Switchboard instance hosted at swx.treadmill.ci. You can find the available hardware resources and documentation of the individual setups there.

  3. The Operator's Guide: This chapter contains guides and other useful information for operators of the Treadmill switchboard or site deployments.

  4. Internals: Documentation of the components that make up Treadmill, such as the switchboard, puppet, or supervisor crates.

Terminology

Treamill is a distributed system composed of multiple different components. While individual deployments may differ and feature different sets of components, this guide establishes the following general terminology:

The Treadmill testbed / the Treadmill system

Describes the overall Treadmill system, including all deployments and central components. Excludes external actors, such as users or platforms that interact with the Treadmill system.

Device Under Test / DUT

A chip, board, or other device that can be programmed, debugged, interacted with, or otherwise controlled by users or the Treadmill testbed. Typically, DUTs will be development boards that feature microcontrollers or SoCs, like a Nordic Semiconductor nRF52840DK board.

Site

A collection of DUTs and other shared, non-global infrastructure. A site has one or more DUTs, companion software and hardware per DUT, and also includes central software or hardware components shared among DUTs.

Deployment

A physical deployment of a part of a Treadmill system, hosting one or more DUTs. A deployment may consist of multiple Treadmill sites. A deployment is not a concept used in the Treadmill system itself, but used in these and other documents to denote parts of the Treadmill system that are in physically distinct locations and/or under differing administrative control.

Switchboard

The single, centralized controller of a Treadmill testbed. A single Switchboard instance coordinates and orchestrates workloads across multiple Treadmill sites. It also implements central authentication and authorization mechanisms and is the authoritative data source for many of these subsystems.

Supervisor

A supervisor is responsible for managing interactions with a DUT and runs on site-local infrastructure. Examples are QEMU supervisors that manage virtual machines connected to a DUT, or Netboot Supervisors that exercise control over other hosts connected to a DUT. Supervisors connect to and are managed by the Switchboard.

Host

A device or environment exposed to users of the Treadmill system, connected to a DUT. Hosts can be virtual machines, dedicated hardware hosts, containers, or other environments. Hosts run software that is able to interact with and control a DUT. Each Host is managed by a Supervisor.

User Guide

Integrating Treadmill with GitHub Actions

By integrating Treadmill into your GitHub Actions workflows you can automatically test your code on real hardware while retaing the convenience of GitHub Action's reusable actions snippets, declarative configuration, and integration with GitHub repositories. This guide describes how to

  • wire up your repository such that it can automatically launch Treadmill jobs for some Actions CI workflow,
  • attach a GitHub actions runner on the Treadmill host to your repository, and
  • define a Workflow that launches a set of Treadmill jobs to test different parts of your code on a set of different hardware boards.

Just-in-time GitHub Actions Runners

GitHub Actions jobs usually execute on hosted runners provided by GitHub. These runners can be selected by using the appropriate runs-on selector, for instance by specifying a label such as ubuntu-latest. However, these runners do not have access to hardware targets.

In contrast, a Treadmill job provides access to an ephemeral host environment, running a user-supplied image, which itself has access to a hardware target. This host can be used for interactive sessions or, when supplied with an appropriate image, automatically execute some software on startup.

We can use this abstraction that Treadmill provides to run a GitHub Actions self-hosted runner. This software provides the ability to execute regular GitHub actions workflows on a user-supplied compute resource instead of a GitHub-provided hosted runner. However, typically these runners execute workflows on this machine without strong isolation (without a container or VM) and are long-lived: once registered, they are able to run multiple workflows up to the point they are deregistered. We must ensure that workflows cannot accidentally or maliciously influence future workflow runs, and run GitHub workflows on exactly the Treadmill host they are supposed to execute on.

Treadmill's ephemeral job environments can alleviate the first concern: Treadmill jobs always run from a clean, ephemeral image and thus provide a reasonable degree of isolation between jobs. We can ensure the second property using GitHub's concept of just-in-time runners: these Actions runners can be registered using a single-use token and are able to execute exactly one job. This ensures that each GitHub actions workflow runs in a fresh Treadmill job environment, and this job is only able to execute exactly one workflow.

Automatically Launching a Treadmill Job & GitHub Actions Runner

Treadmill does not have a native GitHub actions integration. Instead, we can use a small Actions workflow job running on a GitHub hosted runner (called test-prepare) to prepare a new Treadmill job and register it as a self-hosted runner, to then launch actual test workloads (called test-execute) on this runner in a second step.

We start by defining a workflow file in our repository that runs a job called treadmill-ci, under .github/workflows/treadmill-ci.yml. This test-prepare workflow will first compile the Treadmill CLI client and proceed to log into the Treadmill testbed.

name: treadmill-ci

# You can customize these triggers to your preference:
on:
  pull_request: # Run CI for PRs on any branch
  merge_group: # Run CI for the GitHub merge queue

jobs:
  test-prepare:
    # Run this first step on GitHub's hosted infrastructure
    runs-on: ubuntu-latest

    # Expose a few values to the test-execute job:
    outputs:
      runner-id: ${{ steps.gh-actions-jit-runner-config.outputs.runner-id }}
      tml-job-id: ${{ steps.treadmill-job-launch.outputs.tml-job-id }}

    steps:
	  # Required to compile the Treadmill CLI client:
      - uses: actions-rust-lang/setup-rust-toolchain@v1

      # Fetch the source of the Treadmill CLI client. We do not yet provide
	  # pre-compiled binaries:
      - name: Checkout Treadmill repository
        uses: actions/checkout@v4
        with:
          repository: treadmill-tb/treadmill
          path: treadmill

      # This greatly speeds up future workflow runs:
      - name: Cache Treadmill CLI compilation artifacts
        id: cache-tml-cli
        uses: actions/cache@v4
        with:
          path: treadmill/target
          key: ${{ runner.os }}-tml-cli

      - name: Compile the Treadmill CLI binary
        run: |
          pushd treadmill
          cargo build --package tml-cli
          popd
          echo "$PWD/treadmill/target/debug" >> "$GITHUB_PATH"
We plan to provide pre-compiled CLI binaries and GitHub workflow template that you can import in the future.

Registering a Just-in-time GitHub Actions Runner

Next, we need to create a registration token for the just-in-time GitHub actions runner that will run on a Treadmill host. Unfortunately, GitHub does not provide an easy way to do this from within a GitHub action. Notably, the GitHub API token that is provided to GitHub Actions workflows by default does not have the required capabilities to create new just-in-time GitHub Actions runners. Instead, we have to first create a GitHub App with the required permissions and use it's API token to register the runner.

To create a new GitHub App, navigate to your Organization SettingsDeveloper SettingsGitHub Apps. Here, you are able to create a new application like the following. You can leave most fields blank, as we're not actually using any of the app's features apart from its API token: "Register a new GitHub App" form, with the "GitHub App name" set to "Treadmill GH Actions CI", the description set to "This app is used solely to create GitHub tokens with privileges to create ephemeral, just-in-time GitHub Actions runners." and the "Homepage URL" field set to "https://your-project.org"

Disable WebhookActive and leave the Webhook URL and Secret empty. Under Permissions, the app requires only the Repository permissionsAdministration to be set to Access: Read and write. Set Where can this GitHub App be installedOnly on this account.

With the app created, you should be presented with its settings page. We do not need to generate a client secret for the app. Instead, scroll down until you see the Private keys section and click Generate a private key: "Private keys" section of the GitHub App settings, showing a large blue button labeled "Generate a private key" Your browser should download a .pem file after a short while: Firefox Downloads menu showing a tock-treadmill-gh-actions-ci.2024-09-20.private-key.pem file

We need to make this private key accessible to the GitHub Actions workflow. For this, navigate to your repository's Settings (not the organization settings) → Secrets and variablesActions. Create a new Actions variable called TREADMILL_GH_APP_CLIENT_ID and set its contents to your application's client ID (from its settings page). Create a new Actions secret called TREADMILL_GH_APP_PRIVATE_KEY and copy the contents of the downloaded .pem file into the secret: "Actions secrets / New secret" form, with the "Name" field set to TREADMILL_GH_APP_PRIVATE_KEY and the "Secret" field set to the contents of the downloaded file, starting with -----BEGIN RSA PRIVATE KEY-----

Finally, we need to install this application into the target repository. For this, navigate to your application settings (under Organization SettingsDeveloper SettingsGitHub Appsyour application nameEdit), select Install App, and click Install next to your organization. You can choose to only install the app for one repository as shown below: "Install Tock Treadmill GH Actions CI" screen, with "Only select repositories" selected, and the tock/tock repository being selected as one of the repositories for which the app should be installed

With the app ready, we can extend the GitHub actions workflow to obtain an API token that is able to create new just-in-time runners from within our workflow:

      - name: Generate a token to register new just-in-time runners
        id: generate-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ vars.TREADMILL_GH_APP_CLIENT_ID }}
          private-key: ${{ secrets.TREADMILL_GH_APP_PRIVATE_KEY }}
          owner: ${{ github.repository_owner }}

Finally, we can create a new just-in-time runner in a subsequent step:

      - name: Create GitHub just-in-time runner
        id: gh-actions-jit-runner-config
        env:
          GH_TOKEN: ${{ steps.generate-token.outputs.token }}
        run: |
		  # Create a unique string that identifies this runner across all
		  # workflow invocations and attempts in this repository:
          RUNNER_ID="tml-gh-actions-runner-${GITHUB_REPOSITORY_ID}-${GITHUB_RUN_ID}-${GITHUB_RUN_ATTEMPT}"

		  # Perform the API request to register the just-in-time runner.
          RUNNER_CONFIG_JSON="$(gh api \
            -H "Accept: application/vnd.github+json" \
            -H "X-GitHub-Api-Version: 2022-11-28" \
            /repos/$YOUR_ORG/$YOUR_REPO/actions/runners/generate-jitconfig \
            -f "name=$RUNNER_ID" \
            -F "runner_group_id=1" \
            -f "labels[]=$RUNNER_ID" \
            -f "work_folder=_work")"

		  # The above returns a JSON object containing a base64-encoded
		  # "jit config". We need to retain this value for starting the runner.
		  # Provide it to subsequent steps as an output:
		  echo "jitconfig=$(echo "$RUNNER_CONFIG_JSON" | jq -r '.encoded_jit_config')"

	      # The test-execute workflow will need to match on a specific
		  # runner-label assigned to the self-hosted runner. Export our
		  # runner-id here, which we've set as a label above:
          echo "runner-id=$RUNNER_ID"

Next, we'll pass this value into the Treadmill job's parameters and start a job that launches a GitHub actions runner on boot.

Launching a Treadmill Job

With the runner registration token generated, we're ready to launch the Treadmill job that will ultimately host this runner. For this, we add another workflow step as follows:

      - name: Create GitHub just-in-time runner
        id: treadmill-launch-job
        env:
          TML_API_TOKEN: ${{ secrets.TREADMILL_API_TOKEN }}
		  # A Treadmill GitHub Actions image which includes the self-hosted runner:
		  IMAGE_ID: "d407b09b9f56c666d0d3350890e364ba16aad08b484f4ca1de19d42569cc79b1"
		  DUT_BOARD: "nrf52840dk"
        run: |
          echo "Enqueueing Treadmill job:"

	      # Manually create a JSON object that specifies the job parameters and
		  # contains the registration token for the GitHub actions runner.
		  #
		  # The Treadmill GitHub Actions images will search for this parameter
		  # and use it to configure their included self-hosted runner:
          TML_JOB_PARAMETERS="{\
		    \"gh-actions-runner-encoded-jit-config\": {\
			  \"secret\": true, \"value\": \"${{ steps.gh-actions-jit-runner-config.outputs.jitconfig }}\"\
		    }
		  }"

          # Finally, run the `job enqueue` command. You can optionally specify
		  # SSH keys for interactive debugging:
          TML_JOB_ID_JSON="$(tml job enqueue \
            "IMAGE_ID" \
            --tag-config "board:$DUT_BOARD" \
            --parameters "$TML_JOB_PARAMETERS" \
          )"

          TML_JOB_ID="$(echo "$TML_JOB_ID_JSON" | jq -r .job_id)"
          echo "Enqueued Treadmill job with ID $TML_JOB_ID"

          # Pass the job IDs and other configuration data into the outputs of
          # this step, such that we can run test-execute job instances for each
          # Treadmill job we've started:
          echo "tml-job-id=\"$TML_JOB_ID\"" >> "$GITHUB_OUTPUT"

We provide another repository secret, called TML_API_TOKEN, to this step. The tml CLI client will detect this environment variable and use the API token to authenticate against the Switchboard API.

As of now, the only way to create a long-lived API token useful for such workflows is by manually editing the database. We plan to create an API for managing API tokens in the future.

This step takes in the jitconfig output from the previous step and enqueues a new Treadmill job that is parameterized over this value. It is important to set the Treadmill image ID to an image which is configured to run a GitHub Actions self-hosted runner on bootup and performs the necessary configuration based on the gh-actions-runner-encoded-jit-config parameter.

After this step is executed, the Treadmill testbed will launch this job on an appropriate host (selected by tag board:$DUT_BOARD) and the host will register a new GitHub actions runner. We now define the part of the Actions workflow file that runs on the Treadmill host itself.

Running a GitHub Actions Job on the Treadmill Host

To run a job on our newly started host, we add another job definition to our workflow file. Importantly, this second test-execute job has a dependency on the first test-prepare job. It also selects the unique RUNNER_ID that we've generated above as its runs-on target. This ensures that this job will only be eligible on the Treadmill host that we've requested for it. We can then proceed to run regular steps, as we would in any other GitHub Actions workflow file:

  test-execute:
    needs: test-prepare
    runs-on: ${{ needs.test-prepare.outputs.runner-id }}

    steps:
      - name: Print Treadmill Job Context and Debug Information
        run: |
          echo "Treadmill job id: ${{ needs.test-prepare.outputs.tml-job-id }}"
          echo "GitHub Actions Runner ID: ${{ needs.test-prepare.outputs.runner-id }}"
          echo "Network configration:"
          ip address
          echo "Attached USB devices:"
          lsusb
          echo "Parameters:"
          ls /run/tml/parameters

      - uses: actions/checkout@v4

      - uses: actions-rust-lang/setup-rust-toolchain@v1

      - name: Build the Tock kernel
        run: |
          pushd boards/nordic/nrf52840dk
          unset RUSTFLAGS
          make
          popd

Deployments

treadmill.ci Public Images

For now, we manually update the index of all files on the image server:

[root@sns31:/var/www/a.images.treadmill.ci]# find -type f ! -path './image.txt' ! -path './all.txt' > all.txt

To create a mirror of this image store based on this index, you can use the following bash snippet, which will create the Treadmill image store structure (images/ and blobs/) in your current directory and download all files not already present:

$ wget -r -p -E -K -np -nH -nc --content-disposition --trust-server-names --no-http-keep-alive -i <(wget -O- -o/dev/null https://a.images.treadmill.ci/all.txt | sed 's|^\./|https://a.images.treadmill.ci/|')

vm-ubuntu-2204-amd64-uefi

Versions:

Build Dategit RevisionImage ID
2024-10-1368cfe43a22616a372120b0afce9310a07c2e3b4c897b9cbccdfec4cf01ccbcca82c156ee05
2024-09-24f7f6a602394864215aff5840792f3f871cb74d0e74170b199406a56422612efa715e72e1a5
Build logs:

68cfe43a22

leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> git rev-parse HEAD
68cfe43a225bf83bba4fe3fe11723bda7da9c45f
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> nix-build -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/fcssv6py0hh2p4hfd9w5h9pl5d3ysz5p-treadmill-store
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/6c/
blobs/6c/82/
blobs/6c/82/47/
blobs/6c/82/47/6c8247e4440a4f9a691f67643c1d2adf87d48b6c475bd7b83599851cec785164
images/61/
images/61/6a/
images/61/6a/37/
images/61/6a/37/616a372120b0afce9310a07c2e3b4c897b9cbccdfec4cf01ccbcca82c156ee05

sent 941,459,348 bytes  received 113 bytes  81,866,040.09 bytes/sec
total size is 941,229,060  speedup is 1.00

f7f6a60239

leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> git rev-parse HEAD
f7f6a6023970684ab56515fcdedf1b5792f368f7
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> nix-build -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/1bjwlkjbxq7nal5sbll6snh9wc0ingbv-treadmill-store
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/
blobs/33/
blobs/33/31/
blobs/33/31/75/
blobs/33/31/75/33317569a76291991bb8dae68a08b2369221a229192eec1ad3227d38826da281
images/
images/48/
images/48/64/
images/48/64/21/
images/48/64/21/4864215aff5840792f3f871cb74d0e74170b199406a56422612efa715e72e1a5

sent 940,869,394 bytes  received 113 bytes  89,606,619.71 bytes/sec
total size is 940,639,236  speedup is 1.00

vm-ubuntu-2204-amd64-uefi + GitHub Actions Runner

Versions:

Build Dategit RevisionImage ID
2024-10-1368cfe43a229ac6e2f62fec7d41d81df9a3b2fc40f5b4efa3e94055ea43a83e29dc77b791ee
2024-09-24f7f6a602390373bb7d728b36cb6083cfe12f27038b71972ceb90563b0037d4012df7b62bf4
Build logs:

68cfe43a22

leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> git rev-parse HEAD
68cfe43a225bf83bba4fe3fe11723bda7da9c45f
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> nix-build gh-actions-overlay.nix
/nix/store/25p7sbadzw5rj7b1dz23zxacw0ri8nzr-image-store
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/06/
blobs/06/ff/
blobs/06/ff/9f/
blobs/06/ff/9f/06ff9fbb107733147c0ab2bd92efd4a2844b42c9ec60945d8e84de1b6194ed61
blobs/6c/82/47/6c8247e4440a4f9a691f67643c1d2adf87d48b6c475bd7b83599851cec785164
images/9a/
images/9a/c6/
images/9a/c6/e2/
images/9a/c6/e2/9ac6e2f62fec7d41d81df9a3b2fc40f5b4efa3e94055ea43a83e29dc77b791ee

sent 658,136,207 bytes  received 214,956 bytes  77,453,078.00 bytes/sec
total size is 1,599,080,209  speedup is 2.43

f7f6a60239

leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> git rev-parse HEAD
f7f6a6023970684ab56515fcdedf1b5792f368f7
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> nix-build gh-actions-overlay.nix
/nix/store/yzn9rhawqslvl8y7b55sq6n19lhlcxrx-image-store
leons@caesium ~/p/t/i/vm-ubuntu-2204-amd64-uefi (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/33/31/75/33317569a76291991bb8dae68a08b2369221a229192eec1ad3227d38826da281
blobs/9b/
blobs/9b/bc/
blobs/9b/bc/f6/
blobs/9b/bc/f6/9bbcf6d6a67886ac58b9d6cdbb87b49e1a14ebeb8b19b99279b3d73eacdf00b0
images/03/
images/03/73/
images/03/73/bb/
images/03/73/bb/0373bb7d728b36cb6083cfe12f27038b71972ceb90563b0037d4012df7b62bf4

sent 658,398,373 bytes  received 214,879 bytes  69,327,710.74 bytes/sec
total size is 1,598,752,529  speedup is 2.43

netboot-raspberrypi-nbd

Versions:

Build Dategit RevisionImage ID
2024-10-30a4d1690d1ea1b25b7cd0cbea2abd2ee472761180049f31c736095f81c16c65a5877e9f2c44
2024-10-156803d17a745db0bcba4ca3295c83d8cb0318651b78469b90cda9f124011c2bd15a0f1f8999
2024-10-1368cfe43a22f0617619bfb9a459a42b70101af65ef6b8d34631955f1d46423674e9897f26fc
2024-10-12914501ec25453facb39f3d786a3ab3075358665fca850025e5b342487066f7a5c5482bd8ab
Build logs:

a4d1690d1e

leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
a4d1690d1ef9c2e330a71237913279ab90ca545d
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/nakjliab1q6cd0l3f1v2zl3c65wghfl7-treadmill-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/6a/
blobs/6a/d5/
blobs/6a/d5/57/
blobs/6a/d5/57/6ad557ac9249f743b56124e0157b01e5cf28fcd5c45222b48ced16804b17eb09
blobs/9d/
blobs/9d/30/
blobs/9d/30/51/
blobs/9d/30/51/9d30513e0dc24566abb2271f269e6509db9866d313a5e0445e3afe9029542947
images/a1/
images/a1/b2/
images/a1/b2/5b/
images/a1/b2/5b/a1b25b7cd0cbea2abd2ee472761180049f31c736095f81c16c65a5877e9f2c44

sent 2,163,215,797 bytes  received 148 bytes  105,522,729.02 bytes/sec
total size is 2,162,687,123  speedup is 1.00

6803d17a74

leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
6803d17a74a4158e80fc6bc6fe44c64543ff0d15
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build -I nixpkgs=https://github.com/nixos/nixpkgs/archive/release-24.05.tar.gz -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/b4vwspja2w7zp8slajn4zb6xydz6bdp8-treadmill-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/50/a3/39/50a339bb4ec10902d7bae426fe216a8008fca81fa82ce9a8036ebad998320c98
blobs/e4/4b/bd/e44bbd64b70c8afea5f704e8b6884f7d52bee81c75b84ac443bb77e45901acbf
images/5d/b0/bc/5db0bcba4ca3295c83d8cb0318651b78469b90cda9f124011c2bd15a0f1f8999

sent 217,190 bytes  received 378,929 bytes  51,836.43 bytes/sec
total size is 2,155,740,307  speedup is 3,616.29

68cfe43a22

leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
68cfe43a225bf83bba4fe3fe11723bda7da9c45f
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/wc452qz6yp2fy7qdlk0sn71rbcsky45g-treadmill-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/33/24/
blobs/33/24/52/
blobs/33/24/52/3324528e034d27c28f4b58b734aab3e0b041a1c57c044bcef1a3c552ff88665a
blobs/50/
blobs/50/16/
blobs/50/16/df/
blobs/50/16/df/5016df56e359098cb3c6e44bee77ee390c71e855908e4b0a528cbf4ba5d37f4f
images/f0/
images/f0/61/
images/f0/61/76/
images/f0/61/76/f0617619bfb9a459a42b70101af65ef6b8d34631955f1d46423674e9897f26fc

sent 2,159,544,890 bytes  received 145 bytes  105,343,660.24 bytes/sec
total size is 2,159,017,107  speedup is 1.00

914501ec25

leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
914501ec25617613d8bc4d5ca034438e3030acf3
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build -E 'with import <nixpkgs> {}; callPackage ./default.nix {}'
/nix/store/8yhb8zc7n0dj1a1y9gc1n8l9w84firk8-treadmill-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/1d/67/24/1d6724e19dee478cc8b6b6e09cd8d3ba415818aac605acf4a7679159f246dcbf
blobs/44/a2/5a/44a25acaf1e384ffd6926d613cca854563bc62ad6515e1645ac4151f51c55054
images/45/
images/45/3f/
images/45/3f/ac/
images/45/3f/ac/453facb39f3d786a3ab3075358665fca850025e5b342487066f7a5c5482bd8ab
sent 222,162 bytes  received 385,681 bytes  52,855.91 bytes/sec
total size is 2,098,687,124  speedup is 3,452.68

netboot-raspberrypi-nbd + GitHub Actions Runner

Versions:

Build Dategit RevisionImage ID
2024-10-30a4d1690d1edf24da6c7a03d87b1b6b55162383a9dfdf48a129b5f3e648748f0f9d11cdb470
2024-10-156803d17a741b6900eff30f37b6d012240f63aa77a22e20934e7f6ebf38e25310552dc08378
2024-10-1368cfe43a225f4b61324c27472b5354cd11229a0936320148cd6e852fbf05e1b7ff5b4598e6
2024-09-24914501ec25df8337148b0b3c63b400955b7ea49b202f34ecb111b61cd60c45a96076d9e31a
Build logs:

a4d1690d1e

leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main) [1]> git rev-parse HEAD
a4d1690d1ef9c2e330a71237913279ab90ca545d
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build gh-actions-runner-overlay.nix
/nix/store/v751jk869i22ppplffkrc0c5jvaqbivg-image-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/30/
blobs/30/1b/
blobs/30/1b/82/
blobs/30/1b/82/301b824fdf94fe658a389fdc6cf147e0ff4c1f06c4403a74d00331f6aebb1798
blobs/6a/d5/57/6ad557ac9249f743b56124e0157b01e5cf28fcd5c45222b48ced16804b17eb09
blobs/9d/30/51/9d30513e0dc24566abb2271f269e6509db9866d313a5e0445e3afe9029542947
images/df/24/
images/df/24/da/
images/df/24/da/df24da6c7a03d87b1b6b55162383a9dfdf48a129b5f3e648748f0f9d11cdb470

sent 470,751,703 bytes  received 379,466 bytes  32,491,804.76 bytes/sec
total size is 2,633,105,217  speedup is 5.59

6803d17a74

leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
6803d17a74a4158e80fc6bc6fe44c64543ff0d15
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build -I nixpkgs=https://github.com/nixos/nixpkgs/archive/release-24.05.tar.gz gh-actions-runner-overlay.nix
/nix/store/148134wsj8h3jbaz6gn7dl1igywgg48a-image-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/50/a3/39/50a339bb4ec10902d7bae426fe216a8008fca81fa82ce9a8036ebad998320c98
blobs/e4/4b/bd/e44bbd64b70c8afea5f704e8b6884f7d52bee81c75b84ac443bb77e45901acbf
blobs/f8/
blobs/f8/d0/
blobs/f8/d0/61/
blobs/f8/d0/61/f8d06173c89ea48fb3c5214a7f16c3fb2c5964732602dcd230d535984d23e206
images/1b/
images/1b/69/
images/1b/69/00/
images/1b/69/00/1b6900eff30f37b6d012240f63aa77a22e20934e7f6ebf38e25310552dc08378

sent 470,030,344 bytes  received 378,961 bytes  30,348,987.42 bytes/sec
total size is 2,625,437,505  speedup is 5.58

68cfe43a22

leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
68cfe43a225bf83bba4fe3fe11723bda7da9c45f
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build gh-actions-runner-overlay.nix
/nix/store/wcihc56rzaqhbvqj0amzza8qk6ss69sv-image-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/33/24/52/3324528e034d27c28f4b58b734aab3e0b041a1c57c044bcef1a3c552ff88665a
blobs/4f/
blobs/4f/5d/
blobs/4f/5d/5f/
blobs/4f/5d/5f/4f5d5fb9780430b4fa4b8747c74af7d60f8a4e1f5accb3cd9871d66bf674b8ca
blobs/50/16/df/5016df56e359098cb3c6e44bee77ee390c71e855908e4b0a528cbf4ba5d37f4f
images/5f/
images/5f/4b/
images/5f/4b/61/
images/5f/4b/61/5f4b61324c27472b5354cd11229a0936320148cd6e852fbf05e1b7ff5b4598e6

sent 469,637,168 bytes  received 379,189 bytes  30,323,635.94 bytes/sec
total size is 2,628,321,089  speedup is 5.59

914501ec25

leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> git rev-parse HEAD
914501ec25617613d8bc4d5ca034438e3030acf3
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> nix-build gh-actions-runner-overlay.nix
/nix/store/i0mqkn0ygp5zn7d1fd10h0z5msqav7vf-image-store
leons@caesium ~/p/t/i/netboot-raspberrypi-nbd (main)> rsync -rv -L result/ leons@sns31.cs.princeton.edu:/var/www/a.images.treadmill.ci/
sending incremental file list
image.txt
blobs/1d/
blobs/1d/67/
blobs/1d/67/24/
blobs/1d/67/24/1d6724e19dee478cc8b6b6e09cd8d3ba415818aac605acf4a7679159f246dcbf
blobs/44/
blobs/44/a2/
blobs/44/a2/5a/
blobs/44/a2/5a/44a25acaf1e384ffd6926d613cca854563bc62ad6515e1645ac4151f51c55054
blobs/55/
blobs/55/57/
blobs/55/57/dc/
blobs/55/57/dc/5557dc4e01ee4e2b4698931332b38a754c55f9da9ff48c7de8d4728fdf9683d1
images/df/
images/df/83/
images/df/83/37/
images/df/83/37/df8337148b0b3c63b400955b7ea49b202f34ecb111b61cd60c45a96076d9e31a
sent 2,570,257,715 bytes  received 183 bytes  100,794,427.37 bytes/sec
total size is 2,569,629,506  speedup is 1.00

treadmill.ci Sites

Sites:

Site pton-srv0

IDTypeBoardHostSSH Endpoint
0679be07...Netboot (NBD)Nordic Semiconductor nRF52840DKRaspberry Pi 5 8GBsns30.cs.princeton.edu:22006
0af84b36...QEMU VMNordic Semiconductor nRF52840DKsns30.cs.princeton.edu:22030
1bdc10a7...QEMU VMNordic Semiconductor nRF52840DKsns30.cs.princeton.edu:22026
25b97cf7...QEMU VMNordic Semiconductor nRF52840DKsns30.cs.princeton.edu:22034
524aa422...Netboot (NBD)Nordic Semiconductor nRF52840DKRaspberry Pi 5 8GBsns30.cs.princeton.edu:22002
56f98833...Netboot (NBD)STMicroelectronics NUCLEO-F429ZIRaspberry Pi 5 8GBsns30.cs.princeton.edu:22014
64e5e94d...Netboot (NBD)Digilent Arty-A7 35TRaspberry Pi 5 8GBsns30.cs.princeton.edu:22018
8723bd6d...Netboot (NBD)Nordic Semiconductor nRF52840DKRaspberry Pi 5 8GBsns30.cs.princeton.edu:22010
8ff22e8e...Netboot (NBD)Nordic Semiconductor nRF52840DKRaspberry Pi 5 8GBsns30.cs.princeton.edu:22022
fb1384d5...Netboot (NBD)Nordic Semiconductor nRF52840DK Cluster (4 Boards)Raspberry Pi 5 8GBsns30.cs.princeton.edu:22038

Operator's Guide

Admin Snippets

Create a new Switchboard Postgres SQL Database

CREATE DATABASE "treadmill_switchboard" WITH OWNER "treadmill_switchboard" ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8';

Create a new user with admin privileges

This will prompt for username and email, automatically generate a password and dump an SQL transaction to insert the user & privilege assignment into the database:

nix-shell -p 'python3.withPackages (pypkgs: with pypkgs; [ argon2-cffi ])' --run 'python3 -c "import uuid; import secrets; import argon2; name = input(\"Name: \"); email = input(\"Email: \"); password = secrets.token_urlsafe(16); hashed = argon2.PasswordHasher().hash(password); print(\"Password:\", password); id = uuid.uuid4(); print(\"\n\nvvvvv CUT HERE vvvvv\n\nbegin;\"); print(f\"INSERT INTO tml_switchboard.users (user_id, name, email, password_hash, user_type, locked) VALUES ('"'"'{id}'"'"', '"'"'{name}'"'"', '"'"'{email}'"'"', '"'"'{hashed}'"'"', '"'"'system'"'"', false);\"); print(f\"INSERT INTO tml_switchboard.user_privileges (user_id, permission) VALUES ('"'"'{id}'"'"', '"'"'admin'"'"');\"); print(\"commit;\n\n^^^^^ CUT HERE ^^^^^\")"'

Example output:

Name: testificate
Email: foo@example.org
Password: V99gZIffbREGBCGLrfB54A


vvvvv CUT HERE vvvvv

begin;
INSERT INTO tml_switchboard.users (user_id, name, email, password_hash, user_type, locked) VALUES ('e1246bc8-c3b6-4ad7-9d13-a15a2b726a63', 'testificate', 'foo@example.org', '$argon2id$v=19$m=65536,t=3,p=4$Ih9TJgPYrJQFowXzS24Vgw$aGomGlTN1tugKS7HicqtaSBoQzfKVMkU/EOqBA8q1Dw', 'system', false);
INSERT INTO tml_switchboard.user_privileges (user_id, permission) VALUES ('e1246bc8-c3b6-4ad7-9d13-a15a2b726a63', 'admin');
commit;

^^^^^ CUT HERE ^^^^^

Make deployment configuration changes on the supervisor server & push locally

Assuming the Treadmill deployments repo is cloned at /var/state/treadmill-deployments on machine tockci-pton-srv0, we can make local edits to this repository on that machine and test them immediately:

[root@tockci-pton-srv0:/var/state/treadmill-deployments]# echo "hello world" > foo

[root@tockci-pton-srv0:/var/state/treadmill-deployments]# nixos-rebuild test # test the changes

Now, assuming that everything works, we want to commit these changes back to the deployments repository upstream, without giving the machine push access. For this, create a commit on the remote machine. We avoid persistently setting a Git committer name or email, as the machine may be shared amongst multiple admins:

[root@tockci-pton-srv0:/var/state/treadmill-deployments]# git \
  -c user.name="Testificate" \
  -c user.email="testificate@example.org" \
  commit -m "Important changes"
[main 161743c] Important changes
 1 file changed, 1 insertion(+)
 create mode 100644 foo

Now, on your local machine, in the deployments repository, we can fetch this commit without setting up a git remote like so:

testificate@laptop treadmill-tb/deployments (main)> git fetch root@tockci-pton-srv0:/var/state/treadmill-deployments
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
Unpacking objects: 100% (3/3), 266 bytes | 266.00 KiB/s, done.
From tockci-pton-srv0:/var/state/treadmill-deployments
 * branch            HEAD       -> FETCH_HEAD

We can apply these fetched changes onto our local branch like so:

  • In case the changes apply cleanly:

    testificate@laptop treadmill-tb/deployments (main)> git merge --ff-only FETCH_HEAD
    Updating a0c7fd6..161743c
    Fast-forward
     foo | 1 +
     1 file changed, 1 insertion(+)
     create mode 100644 foo
    
  • In case the refs have diverged:

    testificate@laptop treadmill-tb/deployments (main)> git rebase FETCH_HEAD
    Successfully rebased and updated refs/heads/main.
    testificate@laptop treadmill-tb/deployments (main)> git rebase origin/main
    Successfully rebased and updated refs/heads/main.
    

    In this case, the first rebase puts all the divergent commits on top of what we've fetched from the Treadmill supervisor machine, and the second inverts this: the machine commits will be applied on top of the changes in our push remote. Replace origin/main with your target branch as appropriate.

Push the changes to the upstream remote:

testificate@laptop treadmill-tb/deployments (main)> git push
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 16 threads
Compressing objects: 100% (6/6), done.
Writing objects: 100% (7/7), 815 bytes | 815.00 KiB/s, done.
Total 7 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 1 local object.
To github.com:treadmill-tb/deployments.git
   a0c7fd6..161743c  main -> main

And finally, fetch the new history back onto the Treadmill supervisor machine:

[root@tockci-pton-srv0:/var/state/treadmill-deployments]# git pull --rebase
From https://github.com/treadmill-tb/deployments
   a0c7fd6..161743c  main       -> origin/main
Already up to date.

This last step will sync the (rebased) history back onto the Treadmill deployments machine.

Internals

Job Lifecycle

Treadmill jobs represent units of work schedulable on a supervisor. Each job is eligible to run on a set of supervisors, limited by a set of tag filter expressions governed by the job request, and the permissions of the user who scheduled the job.

From the point of their creation up to their successful completion or failure, jobs go through a set of state changes. A job's state is composed of two components: its execution state, and its exit status.

A job's execution state describes the current state of the schedulable unit of work and is controlled by the Switchboard (e.g., by assigning a job on a particular supervisor), and the Supervisor (e.g., by starting or terminating a Virtual Machine). A job may not necessarily transition through all defined execution states; states may be arbitrarily skipped. All jobs must eventually end up in the Terminated state. Only some transitions between different execution states are legal. When attempting to take an illegal transition towards another execution state, the Switchboard may ignore this transition, or attempt to terminate the given job. A job that reached the Terminated state must not transition to other states; a Terminated execution state is final.

The exit status is controlled by the supervisor and describes the user-visible state that the job assumes once it reaches the Terminated execution state. A job's exit status may be set multiple times, but must not change once the job's execution state reaches Terminated. The execution state can be used to communicate whether the Treadmill system was able to successfully schedule the job, whether there were any Treadmill system-internal errors that prevented its successful execution, and whether the user-defined job workload reported a success or error result. Only some transitions between different exit statuses are legal. When attempting to take an illegal transition towards another exit status, the previous exit status remains valid.

We describe these two components as Rust enums below.

Execution State

#![allow(unused)]
fn main() {
enum ExecutionState {
    /// A job object has been created, but it has not been assigned to a
    /// supervisor yet.
    ///
    /// This is the starting state for newly created jobs.
    Queued,

    /// A job object has been assigned to a particular supervisor.
    Scheduled,

    /// A job is scheduled on a particular supervisor and is starting or
    /// restarting.
    ///
    /// An `Initializing` job may itself report different sub-states which
    /// indicate progress while starting the job. These are for informational
    /// purposes only. Not all restarts of a job's host will re-enter the
    /// `Starting` state.
    Initializing,

    /// The job is fully started and ready to execute user-defined workloads.
    ///
    /// A `Ready` job may itself report different sub-states which indicate
    /// progress or certain events, such as a soft-reboot of the job's host.
    /// These are for informational purposes only.
    Ready,

    /// The job has been requested to terminate.
    ///
    /// A `Terminating` job may itself report different sub-states which
    /// indicate progress of requesting a host shutdown, deallocating of
    /// resources, and other events.
    Terminating,

    /// The job has been terminated.
    ///
    /// This state is final. The job must not transition into any other
    /// execution states, and its exit status must not change.
    Terminated,
}
}

A transition into Queued may only be performed by the Switchboard. A transition into the Initializing, Ready, and Terminating state may only be performed by a Supervisor. The Terminated state can be reached by either

  • an explicit state transition initiated by a Supervisor, or
  • the Switchboard, when it observes that a Supervisor is no longer reporting to be executing a job that was once Scheduled on it. In this case, the exit status shall be set to SupervisorJobDropped. This may happen in the case of Supervisor failures or restarts.

Valid Transitions:

To →
From ↓
Q'dSchedInitReadyTerm-ingTerm'd
Queued-
Scheduled-
Initializing-
Ready-
Terminating-
Terminated-

Exit Status

#![allow(unused)]
fn main() {
enum ExitStatus {
    /// There are no supervisors registered that this job can be scheduled on,
    /// considering the job's tag filter expression and the scheduling user's
    /// permissions.
    ///
    /// Jobs may enter this state either immediately, at the time of scheduling
    /// a job, or when no eligible supervisor is found within a timeout.
    ///
    /// **This exit status is final.** No subsequently reported exit status may
    /// override this status.
    SupervisorMatchError,

    /// There were eligible supervisors registered with the Switchboard, but
    /// the job could not be scheduled on one of them within a given timeout.
    ///
    /// **This exit status is final.** No subsequently reported exit status may
    /// override this status.
    QueueTimeout,

    /// An internal error occurred while scheduling or running this job on the
    /// supervisor. This state may optionally contain a message that contains
    /// further information on the error.
    ///
    /// This exit status may be set by both the Switchboard (e.g., when there is
    /// an error communicating with the Supervisor), or by the Supervisor.
    ///
    /// **This exit status is final.** No subsequently reported exit status may
    /// override this status.
    InternalSupervisorError,

    /// The Supervisor reports that the host failed to start.
    ///
    /// This may be due to an error in fetching the requested image, a resource
    /// that vanished (e.g., when trying to continue a previous job), failure
    /// to allocate sufficient resources, etc.
    ///
    /// **This exit status is final.** No subsequently reported exit status may
    /// override this status.
    SupervisorHostStartError,

    /// The job was canceled by a user.
    ///
    /// **This exit status is final.** No subsequently reported exit status may
    /// override this status.
    JobCanceled,

    /// The host itself reports that the user-defined workload executed
    /// successfully.
    ///
    /// This status may be reported through the Puppet process executing within
    /// the host, and may optionally contain additional user-supplied
    /// information.
    WorkloadFinishedSuccess,

    /// The host itself reports that the user-defined workload failed with an
    /// error.
    ///
    /// This status may be reported through the Puppet process executing within
    /// the host, and may optionally contain additional user-supplied
    /// information.
    WorkloadFinishedError,

    /// The host itself reports that the user-defined workload terminated,
    /// while indicating neither success nor failure.
    ///
    /// This status may be reported through the Puppet process executing within
    /// the host, and may optionally contain additional user-supplied
    /// information.
    WorkloadFinishedUnknown,

    /// The job vanished from its supervisor, without reaching the
    /// `Terminated` execution state first.
    ///
    /// This may be due to a supervisor crash or restart. This exit status is
    /// can only be set by the Switchboard. **This exit status is final.** No
    /// subsequently reported exit status may override this status.
    SupervisorDroppedJob,

    /// The job timed out while dispatched.
    ///
    /// **This exit status is final. No subsequently reported exit status may
    /// override this status.**
    JobTimeout,
}
}

Valid Transitions:

To →
From ↓
SupMFQTimeIntSupESupHSEJobCWFS*SupJDropJobTimeout
SupervisorMatchError-
QueueTimeout-
InternalSupervisorError-
SupervisorHostStartError-
JobCanceled-
WorkloadFinished*-
SupervisorDroppedJob-
JobTimeout-

We further impose the following restricts on transitions of exit statuses depending on the current execution state (note the flipped rows & columns for readability):

In Execution State →
To Exit Status ↓
Q'dSchedInitReadyTerm-ingTerm'd
SupervisorMatchError
QueueTimeout
InternalSupervisorError
SupervisorHostStartError
JobCanceled
WorkloadFinished*
SupervisorJobDropped
JobTimeout