Documentation

How ProgressPals works.

A short technical tour. The architecture, the full CLI, the OpenAI-compatible endpoint, and exactly what is encrypted on the way through your pals.

Commands
10
Endpoint
localhost:11434/v1
Encryption
AES-256 · HKDF
Largest model
Llama 3.1 405B
Quickstart

From zero to swarm
in three commands.

Install the CLI, run init, create a swarm for the model you want, then send the invite link to your pals. That is the whole setup.

  • 1.Install the Python package and detect your hardware.
  • 2.Pick a model. The CLI claims the layers your machine can hold and prints an invite link.
  • 3.Share the link. Each pal runs pals join and starts contributing.
quickstart
$pip install progresspals
$pals init
✓ Detected NVIDIA RTX 4090 · 24 GB VRAM · 64 GB RAM
✓ Config written to ~/.progresspals/config.json (0600)
$pals create --model meta-llama/Llama-3.1-70B
✓ Swarm created · this pal holds layers 0–14
⟶ Invite: progresspals.com/join/9k4j2z
(single-use · encrypted · regenerable)
Architecture

Pipeline parallelism, across your pals.

A modern open-source model is a tall stack of transformer layers — eighty, a hundred and twenty-six, more. The full weights are far too large to fit in any single consumer GPU.

Distributed inference splits the stack. Each pal holds a contiguous slice. Your input flows through the chain: every pal computes only its own layers, hands the activations to the next pal, and so on until the final output emerges.

The win is the point: a model whose weights total 200 GB can run across a team whose machines each hold a fraction of that, as long as enough pals cover the layers between them.

input prompt
alice@studio
layers 0–14
encrypted activations
ben@office
layers 15–47
encrypted activations
casey@rig
layers 48–86
streamed response
Lifecycle

The five-command flow.

01

Detect your hardware.

Inspects VRAM, RAM and GPU type. Writes a local config at ~/.progresspals/config.json with 0600 file permissions. No network calls yet — no account required to install.

$ pals init
02

Start a private swarm.

Claims the model layers your machine can hold. Generates a single-use invite token and prints a link. Default cap is 50 pals; configurable with --max-peers.

$ pals create --model <huggingface-id>
03

Your pals join.

The CLI verifies the invite token, joins a private DHT namespace keyed to the swarm ID, and downloads only its assigned layer slice — much smaller than the full model.

$ pals join <link>
04

Inference flows through the chain.

Any pal can submit a request. Activations move through the chain one pal at a time, each running only its own layers. Output streams back token by token.

$ pals run "<prompt>"
05

Expose the swarm to your tools.

Starts a local HTTP server at http://localhost:11434/v1 that speaks the OpenAI wire format with SSE streaming. Every tool that already talks to OpenAI now talks to your swarm.

$ pals serve
CLI reference

Every command pals can run.

Ten commands, grouped by what they actually do. No daemons, no dashboards, no shadow CLI surface.

Setup

$ pals init

Detect local hardware and write a 0600 config to ~/.progresspals/config.json.

$ pals create --model <id> [--max-peers 50]

Start a private swarm for the given model. Default cap is 50 pals.

Join & inspect

$ pals join <invite-link>

Join a swarm using a single-use invite. Auto-downloads your layer slice.

$ pals status

Show this node's status and the health of every swarm it belongs to.

$ pals list

List every swarm you are a member of, with its model.

$ pals peers

List pals in the current swarm: online, offline, layers held, region, contribution.

Admin

$ pals kick <peer-id>

Creator only. Remove a pal from the swarm.

$ pals invite

Creator only. Generate a fresh single-use invite token.

Inference

$ pals run [--swarm <id>] "<prompt>"

Run inference. Defaults to the most recently used swarm.

$ pals serve [--port 11434]

Start the local OpenAI-compatible HTTP endpoint.

The endpoint

One server, every OpenAI tool.

pals serve exposes the swarm as a standard OpenAI-compatible HTTP server at http://localhost:11434/v1. The wire format is identical — same request, same response, same SSE streaming — so nothing in your stack has to know the difference.

Before · OpenAI

from openai import OpenAI

client = OpenAI(
  base_url="https://api.openai.com/v1",
  api_key=OPENAI_API_KEY,
)

client.chat.completions.create(
  model="gpt-4o",
  messages=[...],
  stream=True,
)

After · ProgressPals

from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:11434/v1",
  api_key="any-string",
)

client.chat.completions.create(
  model="meta-llama/Llama-3.1-405B-Instruct",
  messages=[...],
  stream=True,
)

Endpoints exposed

POST/v1/chat/completions

SSE streaming · OpenAI chat shape

POST/v1/completions

Legacy completion · SSE streaming

GET/v1/models

Returns the swarm's current model

Security

What is encrypted, what is stored, what is not.

Per-swarm AES-256 key

Each swarm has a 256-bit AES key derived from the invite token via HKDF. The key is computed client-side and never leaves member machines.

Encrypted activations

Activation tensors are encrypted before being sent to the next pal in the chain and decrypted on arrival. Anyone in between sees ciphertext.

Supabase stores only a hash

Supabase holds accounts, swarm metadata, the member list, and a hash of the encryption key for invite verification. Not prompts, not weights, not activations, not the key itself.

Per-hop integrity via the auth tag

Because activations travel inside AES-GCM, a pal returning garbage would have to forge a valid 128-bit auth tag without the swarm key. They can't — corruption is detected before the next layer runs and the request reroutes.

Honest about the trust model

The first pal in your chain decrypts your input to run their layers — that is how transformer inference works at all, and no amount of cryptography changes it without a hardware enclave. The simple rule is therefore the right rule: only invite pals you would trust to see your prompts.

Hardware

What you bring to the swarm.

Operating system

Linux or macOS

Standard Python 3 environment. No special drivers beyond what your GPU already needs.

Compute

A consumer GPU

VRAM is the limiter — more VRAM, more layers per pal. NVIDIA boxes run fastest. Apple Silicon (M1+) joins and contributes too, just at lower per-pal throughput. CPU-only joins technically work, slowly.

Configuration

Auto-balanced

You do not assign layers by hand. pals init reads each pal's hardware, the coordinator distributes layers to keep the chain even.

That is the whole product.

Ten commands, one local endpoint, encrypted activations, and the pals you actually trust.