How ProgressPals works.
A short technical tour. The architecture, the full CLI, the OpenAI-compatible endpoint, and exactly what is encrypted on the way through your pals.
From zero to swarm
in three commands.
Install the CLI, run init, create a swarm for the model you want, then send the invite link to your pals. That is the whole setup.
- 1.Install the Python package and detect your hardware.
- 2.Pick a model. The CLI claims the layers your machine can hold and prints an invite link.
- 3.Share the link. Each pal runs pals join and starts contributing.
Pipeline parallelism, across your pals.
A modern open-source model is a tall stack of transformer layers — eighty, a hundred and twenty-six, more. The full weights are far too large to fit in any single consumer GPU.
Distributed inference splits the stack. Each pal holds a contiguous slice. Your input flows through the chain: every pal computes only its own layers, hands the activations to the next pal, and so on until the final output emerges.
The win is the point: a model whose weights total 200 GB can run across a team whose machines each hold a fraction of that, as long as enough pals cover the layers between them.
The five-command flow.
Detect your hardware.
Inspects VRAM, RAM and GPU type. Writes a local config at ~/.progresspals/config.json with 0600 file permissions. No network calls yet — no account required to install.
Start a private swarm.
Claims the model layers your machine can hold. Generates a single-use invite token and prints a link. Default cap is 50 pals; configurable with --max-peers.
Your pals join.
The CLI verifies the invite token, joins a private DHT namespace keyed to the swarm ID, and downloads only its assigned layer slice — much smaller than the full model.
Inference flows through the chain.
Any pal can submit a request. Activations move through the chain one pal at a time, each running only its own layers. Output streams back token by token.
Expose the swarm to your tools.
Starts a local HTTP server at http://localhost:11434/v1 that speaks the OpenAI wire format with SSE streaming. Every tool that already talks to OpenAI now talks to your swarm.
Every command pals can run.
Ten commands, grouped by what they actually do. No daemons, no dashboards, no shadow CLI surface.
Setup
$ pals initDetect local hardware and write a 0600 config to ~/.progresspals/config.json.
$ pals create --model <id> [--max-peers 50]Start a private swarm for the given model. Default cap is 50 pals.
Join & inspect
$ pals join <invite-link>Join a swarm using a single-use invite. Auto-downloads your layer slice.
$ pals statusShow this node's status and the health of every swarm it belongs to.
$ pals listList every swarm you are a member of, with its model.
$ pals peersList pals in the current swarm: online, offline, layers held, region, contribution.
Admin
$ pals kick <peer-id>Creator only. Remove a pal from the swarm.
$ pals inviteCreator only. Generate a fresh single-use invite token.
Inference
$ pals run [--swarm <id>] "<prompt>"Run inference. Defaults to the most recently used swarm.
$ pals serve [--port 11434]Start the local OpenAI-compatible HTTP endpoint.
One server, every OpenAI tool.
pals serve exposes the swarm as a standard OpenAI-compatible HTTP server at http://localhost:11434/v1. The wire format is identical — same request, same response, same SSE streaming — so nothing in your stack has to know the difference.
Before · OpenAI
from openai import OpenAI client = OpenAI( base_url="https://api.openai.com/v1", api_key=OPENAI_API_KEY, ) client.chat.completions.create( model="gpt-4o", messages=[...], stream=True, )
After · ProgressPals
from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1", api_key="any-string", ) client.chat.completions.create( model="meta-llama/Llama-3.1-405B-Instruct", messages=[...], stream=True, )
Endpoints exposed
SSE streaming · OpenAI chat shape
Legacy completion · SSE streaming
Returns the swarm's current model
What is encrypted, what is stored, what is not.
Per-swarm AES-256 key
Each swarm has a 256-bit AES key derived from the invite token via HKDF. The key is computed client-side and never leaves member machines.
Encrypted activations
Activation tensors are encrypted before being sent to the next pal in the chain and decrypted on arrival. Anyone in between sees ciphertext.
Supabase stores only a hash
Supabase holds accounts, swarm metadata, the member list, and a hash of the encryption key for invite verification. Not prompts, not weights, not activations, not the key itself.
Per-hop integrity via the auth tag
Because activations travel inside AES-GCM, a pal returning garbage would have to forge a valid 128-bit auth tag without the swarm key. They can't — corruption is detected before the next layer runs and the request reroutes.
Honest about the trust model
The first pal in your chain decrypts your input to run their layers — that is how transformer inference works at all, and no amount of cryptography changes it without a hardware enclave. The simple rule is therefore the right rule: only invite pals you would trust to see your prompts.
What you bring to the swarm.
Linux or macOS
Standard Python 3 environment. No special drivers beyond what your GPU already needs.
A consumer GPU
VRAM is the limiter — more VRAM, more layers per pal. NVIDIA boxes run fastest. Apple Silicon (M1+) joins and contributes too, just at lower per-pal throughput. CPU-only joins technically work, slowly.
Auto-balanced
You do not assign layers by hand. pals init reads each pal's hardware, the coordinator distributes layers to keep the chain even.
That is the whole product.
Ten commands, one local endpoint, encrypted activations, and the pals you actually trust.