Argus¶
Self-hostable ML experiment monitoring — batches, jobs, GPU/CPU resources, reruns, hyperopt sweeps. Real-time dashboard. Multi-user. One container.
The name comes from Argus Panoptes (Ἄργος Πανόπτης) — the hundred-eyed giant of Greek mythology, tasked with watching over what others could not see. A fitting name for a tool whose job is to keep every eye open so you don't have to.
Argus is an open-source experiment tracker for teams running long ML training
jobs across one or many machines. You instrument your training script with a
two-line Reporter SDK call, and Argus shows you batches, jobs, loss curves,
and live GPU/CPU telemetry in a Vue 3 dashboard.
Find what you need¶
-
:rocket: Getting started
Install Argus with Docker, run for the first time, send your first event.
-
:books: User guide
Dashboard, batches, jobs, the matrix view, sharing, notifications, settings.
-
:snake: SDK reference
The
ReporterAPI, Hydra/Lightning/Keras callbacks, the event schema. -
:wrench: Operations
Docker, configuration, runtime admin settings, the agent, database, retention.
-
:building_construction: Architecture
How the SDK, backend, frontend, and agent fit together.
-
:handshake: Contributing
Repo layout, dev loops, style, how to add a route / migration / integration.
Snapshot¶
| Backend | FastAPI · async SQLAlchemy 2.0 · Alembic · Python ≥3.10 |
| Frontend | Vue 3 · TypeScript · Vite · Pinia · Ant Design Vue · ECharts |
| SDK | argus-reporter on PyPI; Lightning + Keras callbacks |
| Database | SQLite (default, WAL mode) or PostgreSQL via async driver |
| Auth | Email + password (argon2id), GitHub OAuth, JWT dual-key rotation |
| Realtime | One multiplexed Server-Sent Events connection per page |
| Deployment | Single Docker image, optional nginx reverse proxy |
| License | Apache-2.0 |
A 60-second feel¶
from argus import Reporter
with Reporter("my-run",
experiment_type="forecast",
source_project="my-paper",
n_total=1,
monitor_url="http://localhost:8000",
token="em_live_…") as r:
with r.job("run-1", model="patchtst", dataset="etth1") as job:
for epoch in range(50):
job.epoch(epoch,
train_loss=..., val_loss=..., lr=..., batch_time_ms=...,
val_mse=..., val_rmse=..., val_mae=...,
val_r2=..., val_pcc=...)
job.metrics({
"MSE": ..., "RMSE": ...,
"MAE": ..., "R2": ...,
"PCC": ...,
})
That's the whole integration. Heartbeats, GPU snapshots, stop-signal polling, and idempotent retry-with-spill are all handled by the SDK.
Where to next¶
If you have nothing running yet → Installation. If the server is up and you want to push events → Connect a training job. If you are deploying in production → Operations and Admin settings.