Skip to content

Argus

Self-hostable ML experiment monitoring — batches, jobs, GPU/CPU resources, reruns, hyperopt sweeps. Real-time dashboard. Multi-user. One container.

The name comes from Argus Panoptes (Ἄργος Πανόπτης) — the hundred-eyed giant of Greek mythology, tasked with watching over what others could not see. A fitting name for a tool whose job is to keep every eye open so you don't have to.

Argus is an open-source experiment tracker for teams running long ML training jobs across one or many machines. You instrument your training script with a two-line Reporter SDK call, and Argus shows you batches, jobs, loss curves, and live GPU/CPU telemetry in a Vue 3 dashboard.

Find what you need

  • :rocket: Getting started

    Install Argus with Docker, run for the first time, send your first event.

  • :books: User guide

    Dashboard, batches, jobs, the matrix view, sharing, notifications, settings.

  • :snake: SDK reference

    The Reporter API, Hydra/Lightning/Keras callbacks, the event schema.

  • :wrench: Operations

    Docker, configuration, runtime admin settings, the agent, database, retention.

  • :building_construction: Architecture

    How the SDK, backend, frontend, and agent fit together.

  • :handshake: Contributing

    Repo layout, dev loops, style, how to add a route / migration / integration.

Snapshot

Backend FastAPI · async SQLAlchemy 2.0 · Alembic · Python ≥3.10
Frontend Vue 3 · TypeScript · Vite · Pinia · Ant Design Vue · ECharts
SDK argus-reporter on PyPI; Lightning + Keras callbacks
Database SQLite (default, WAL mode) or PostgreSQL via async driver
Auth Email + password (argon2id), GitHub OAuth, JWT dual-key rotation
Realtime One multiplexed Server-Sent Events connection per page
Deployment Single Docker image, optional nginx reverse proxy
License Apache-2.0

A 60-second feel

from argus import Reporter

with Reporter("my-run",
              experiment_type="forecast",
              source_project="my-paper",
              n_total=1,
              monitor_url="http://localhost:8000",
              token="em_live_…") as r:
    with r.job("run-1", model="patchtst", dataset="etth1") as job:
        for epoch in range(50):
            job.epoch(epoch,
                      train_loss=..., val_loss=..., lr=..., batch_time_ms=...,
                      val_mse=..., val_rmse=..., val_mae=...,
                      val_r2=..., val_pcc=...)
        job.metrics({
            "MSE": ..., "RMSE": ...,
            "MAE": ..., "R2":   ...,
            "PCC": ...,
        })

That's the whole integration. Heartbeats, GPU snapshots, stop-signal polling, and idempotent retry-with-spill are all handled by the SDK.

Where to next

If you have nothing running yet → Installation. If the server is up and you want to push events → Connect a training job. If you are deploying in production → Operations and Admin settings.