NewPrivate distributed inference is here

Private AI clusters pooled over the internet, on your hardware.
No VPS, no API fees.

Run massive open-source models no single machine could host. Send one invite link, everyone joins, everyone shares the compute.

$ pip install progresspals

~ — pals create
$pals create --model meta-llama/Llama-3.1-405B-Instruct --max-peers 8
✓ Detected hardware: 24 GB VRAM, 64 GB RAM
✓ Swarm created · this peer holds layers 0–14
⟶ Invite link: progresspals.com/join/9k4j2z
(single-use · encrypted · regenerable)
$pals peers
alice@studio layers 0–14 ↓ 18 t/s
ben@office layers 15–47 ↓ 14 t/s
casey@rig layers 48–86 ↓ 22 t/s
─ 3 peers online · 87/126 layers covered · waiting on 5 more
How it works

Three commands. Real distributed inference.

No coordination overhead, no Kubernetes, no public swarms. Just your people, your hardware, your model.

01

Create a swarm

Pick a model. The CLI detects your hardware, claims the layers you can hold, and generates a single-use invite link.

$ pals create --model meta-llama/Llama-3.1-70B
02

Invite your team

Share one link. Each peer joins, downloads only their assigned slice of the model, and contributes their GPU/RAM.

$ pals join progresspals.com/join/9k4j2z
03

Run it like OpenAI

Start the local OpenAI-compatible endpoint. Point Cursor, Aider, Continue, or any SDK at it. Inference flows through the chain.

$ pals serve --port 11434
The killer feature

A drop-in replacement for OpenAI — running on your team's hardware.

pals serve exposes the swarm as a local OpenAI-compatible endpoint at http://localhost:11434/v1. Any tool that speaks the OpenAI API works unchanged — point it at your endpoint and it codes, chats, and reasons through your private cluster.

CursorAiderContinue.devClineRoo Coden8nOpenAI SDKLangChainCustom scripts

POST /v1/chat/completions · POST /v1/completions · GET /v1/models · SSE streaming

~ — pals serve + Cursor
$pals serve --port 11434
✓ OpenAI-compatible endpoint live
http://localhost:11434/v1
routing through 6 peers · streaming enabled
#In Cursor: Settings → OpenAI Base URL
→ http://localhost:11434/v1
→ model: meta-llama/Llama-3.1-405B-Instruct
$curl http://localhost:11434/v1/chat/completions \
-d '{"model":"...","messages":[...]}'
{ "id": "cmpl-9k4j2z", "choices": [...
For AI agents

Plug it into the tools your team
already uses.

Because the swarm exposes a standard OpenAI-compatible endpoint, anything in your agent stack — coding harnesses, gateways, frameworks — just works.

In your editor

Coding agents

Your team's swarm becomes the brain inside the IDE. Point the agent at the local endpoint and it codes, edits, and refactors against your shared cluster.

CursorClineRoo CodeContinue.devAiderZed
# Cursor → Settings → Models → Custom OpenAI Base URL
http://localhost:11434/v1

# Aider
aider --openai-api-base http://localhost:11434/v1 \
      --openai-api-key any-string
Self-hosted gateways

Personal AI agents

Self-hosted agents and assistants that already speak the OpenAI API. Swap the provider URL for the swarm and they run on your team's hardware instead of someone else's GPUs.

OpenClawOpen WebUIOpen HandsPlandexLiteLLM proxy
# Most gateways read these standard env vars:
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=any-string
Build your own

Agent frameworks & SDKs

Stack your own agents on top. Anything built on the OpenAI SDK accepts a base_url override — your swarm becomes the model layer underneath multi-agent orchestration, RAG, evals, anything.

LangChainLlamaIndexAutoGenCrewAIVercel AI SDKOpenAI SDK
from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:11434/v1",
  api_key="any-string",
)

client.chat.completions.create(
  model="meta-llama/Llama-3.1-405B-Instruct",
  messages=[{"role":"user","content":"..."}],
)

Every tool listed accepts a custom OpenAI base URL. If yours does, it will too — there is no special integration, just the standard /v1/chat/completions contract with SSE streaming.

Features

Built for teams that want
their own models, privately.

Everything you need to stand up a serious cluster with people you trust — without renting a single GPU.

Invite-only swarms

No public discovery, no random peers. Single-use tokens, regenerable, expiring. Only people you invite can join.

OpenAI-compatible endpoint

`pals serve` exposes /v1/chat/completions, /v1/completions, /v1/models with SSE streaming. Cursor, Aider, Continue work unchanged.

Encrypted activations

Per-swarm AES-256 key derived from the invite token via HKDF. Activation tensors are encrypted before leaving each peer.

Pipeline parallelism

Built on Petals. Each peer holds a contiguous slice of the model. The coordinator balances layer assignment automatically.

Run 405B on consumer GPUs

Llama 405B, Mixtral 8x22B, Falcon 180B, BLOOM 176B — models no single machine can hold. Spread them across 4, 8, 20 peers.

Member controls

Live peer list with status, layers held, throughput. Kick peers, regenerate invites, cap swarm size — all from the CLI or TUI.

Supported models

Llama 405B on a few laptops.

Anything Petals supports out of the box, your swarm can run. New architectures land as upstream Petals adds them.

HuggingFace model IDs work directly — just pass --model to pals create.

Llama

Meta
  • 2 70B
  • 3 8B
  • 3 70B
  • 3.1 8B
  • 3.1 70B
  • 3.1 405B
  • 3.3 70B

Mixtral

Mistral
  • 8x7B
  • 8x22B

Falcon

TII
  • 40B
  • 180B

BLOOM

BigScience
  • 176B
Security · honest disclosure

We tell you exactly what the trust model is.

Only invite people you trust.

We are not a public network. There is no swarm discovery, no stranger prompts, no content moderation queue. Your swarm is exactly the people you sent the link to.

Activations are encrypted in transit.

We derive a 256-bit AES key from your invite token via HKDF. Tensors are encrypted before leaving a peer and decrypted on arrival. The key never leaves member machines — Supabase only stores a hash.

What we do not pretend.

P2P inference exposes IP addresses to other swarm members. The first peer in the chain sees decrypted inputs. We sandbox computation where the OS allows it, but this is not a hardware enclave. Use a VPN if the threat model demands it.

Pricing

Free. The whole thing.

No paid tier yet. We will add one based on what teams actually ask for — not before.

Free
$0/ month

Everything. No card. No usage caps beyond the 50-peer hard ceiling.

  • Create private swarms up to 50 peers
  • Single-use invite tokens, regenerable
  • Encrypted activations (AES-256, HKDF-derived)
  • Member list, kick, throughput stats
  • Full CLI access (pals init / create / join / run)
  • OpenAI-compatible local endpoint (pals serve)
  • TUI dashboard (Textual)
  • Supabase-backed auth and invite verification
FAQ

Questions teams ask
before they install.

If yours is not here, the answer is probably either in how it works or in the trust model.

What is ProgressPals?

Private, peer-to-peer AI inference. You and a small group of trusted people pool your hardware over the internet to run large open-source models that no single machine could host on its own. One CLI, one invite link, one local OpenAI-compatible endpoint.

What models can my swarm run?

Llama 2, 3, 3.1 and 3.3 up to 405B, Mixtral 8x7B and 8x22B, Falcon 40B and 180B, BLOOM 176B. Pass any supported HuggingFace model ID to pals create --model.

Can my team use it with Cursor, Aider, or our agent framework?

Yes. pals serve exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Point Cursor, Cline, Roo Code, Continue, Aider, Zed, OpenClaw, Open WebUI, n8n, LangChain, LlamaIndex, AutoGen, CrewAI, the Vercel AI SDK, or anything that uses the OpenAI SDK directly at it — no code changes.

Who can see my prompts?

The first peer in your chain decrypts your input to run their layers — that is how transformer inference works at all. Activations between subsequent peers are encrypted with a per-swarm AES-256 key derived from the invite token. The trust model is simple and honest: only invite people you would trust to see your prompts.

Why private swarms only?

Public AI swarms create content moderation queues, expose users to stranger prompts, and pile on legal liability. Removing public swarms removes all three. You only compute on (and decrypt inputs from) people you actually invited.

Do I need a GPU?

Strongly recommended. Each peer's contribution scales with how many model layers their VRAM can hold. CPU-only peers can technically join, but throughput will be slow enough that you probably want at least one consumer GPU per peer.

Does it work on Apple Silicon (M1, M2, M3, M4)?

Yes. Apple Silicon Macs can join any swarm and contribute layers — the install needs one extra build step the CLI handles for you. Per-pal throughput is lower than on equivalent NVIDIA hardware in v1 (we use PyTorch's Metal path), so a Mac is best as one pal in a mixed swarm or as a client running pals serve. A Mac-native MLX backend is on the roadmap.

How many peers do I need for a big model?

It depends on the model and how aggressively it is quantized, but the rule is intuitive: more layers in the model, or less VRAM per peer, means more peers. The CLI detects each peer's hardware on join and the coordinator auto-assigns layer ranges to balance the chain.

Is it really free?

Yes. No paid tier yet. We will add one when we have real signal from teams about what is worth charging for — not before.

Start your first swarm
in under five minutes.

Linux and macOS. Works with any consumer GPU Petals supports.

install
$pip install progresspals
$pals init
✓ Detected: NVIDIA RTX 4090 · 24 GB VRAM · 64 GB RAM
✓ Config written to ~/.progresspals/config.json (0600)
$pals create --model meta-llama/Llama-3.1-70B
⟶ progresspals.com/join/9k4j2z