Connect a training job¶
Once the platform is running and you have an em_live_… SDK token, you have
several ways to push events. The vanilla Reporter is always the lowest-level
option; framework adapters wrap it.
Install¶
pip install argus-reporter # base SDK
pip install argus-reporter[lightning] # + PyTorch Lightning callback
pip install argus-reporter[keras] # + Keras callback
pip install argus-reporter[hydra] # + Hydra callback
pip install argus-reporter[all-integrations]
Python ≥3.10. Only required runtime dependency is requests.
Credentials¶
The SDK reads two env vars by default; you can also pass them as keyword args:
export ARGUS_URL=https://argus.example.com # NOTE: ARGUS_URL, not ARGUS_BASE_URL
export ARGUS_TOKEN=em_live_…
A spill directory (~/.argus-reporter/) is created on first use to buffer
events when the network is unavailable. Spill files are replayed on next
SDK start, even from a different process.
Vanilla loop¶
from argus import Reporter
with Reporter("my-run",
experiment_type="forecast",
source_project="my-paper",
n_total=1) as r:
with r.job("bert-base-lr3e-5", model="bert-base") as j:
for epoch in range(num_epochs):
if j.stopped: # platform stop button
break
train_loss, val_loss = train_one_epoch()
j.epoch(epoch, train_loss=train_loss, val_loss=val_loss)
j.metrics({"val_loss": val_loss}) # surfaced on job_done
Reporter opens a batch; each r.job opens a job inside that batch.
Multiple jobs in one batch is how you record a sweep that shares a single
training script invocation.
PyTorch Lightning¶
from argus.integrations.lightning import ArgusCallback
import pytorch_lightning as pl # or: import lightning as L
trainer = pl.Trainer(callbacks=[ArgusCallback(
project="my-paper",
job_id="bert-base",
model="bert-base-uncased", # optional metadata on job_start
dataset="glue/sst2", # optional metadata on job_start
argus_url="http://localhost:8000", # or rely on ARGUS_URL
token="em_live_…", # or rely on ARGUS_TOKEN
# Tell the adapter which keys in trainer.callback_metrics map to
# train_loss / val_loss. The first key that resolves to a finite
# float wins. Defaults shown:
train_loss_keys=("train_loss_epoch", "train_loss", "loss"),
val_loss_keys=("val_loss_epoch", "val_loss"),
# Final-epoch headline metrics — read once on on_train_end and
# passed through JobContext.metrics. Anything you self.log() in
# your LightningModule can land here.
final_metric_keys=("val_mse", "val_rmse", "val_mae",
"val_r2", "val_pcc"),
# Daemons default OFF for Lightning runs (vs True for raw Reporter).
heartbeat=False, stop_polling=False, resource_snapshot=False,
)])
trainer.fit(model, datamodule)
Constructor (kwargs only, from client/argus/integrations/lightning.py):
| Argument | Default | Notes |
|---|---|---|
project |
required | source_project on Reporter |
job_id |
required | identifier for this run |
model, dataset |
None | propagated into job_start |
argus_url, token |
env | fallback to ARGUS_URL / ARGUS_TOKEN |
experiment_type |
"lightning" |
forwarded to Reporter |
batch_prefix |
"lightning" |
prefix for the auto-generated batch id |
train_loss_keys |
("train_loss_epoch", "train_loss", "loss") |
first match wins |
val_loss_keys |
("val_loss_epoch", "val_loss") |
first match wins |
final_metric_keys |
() |
read at on_train_end and passed to JobContext.metrics |
heartbeat, stop_polling, resource_snapshot |
False |
opt-in |
auto_upload_dirs |
None | folders whose images/PDFs upload on clean exit |
Hook coverage:
| Lightning hook | Argus event |
|---|---|
on_train_start |
Reporter.__enter__ + JobContext.__enter__ (batch_start + job_start) |
on_train_epoch_end |
JobContext.epoch with train_loss, lr |
on_validation_epoch_end |
JobContext.epoch with val_loss |
on_train_end |
clean exit; final metrics flushed via JobContext.metrics |
on_exception |
failure exit (job_failed + batch_failed) |
The two epoch hooks are dedup'd against each other (Lightning 2.x fires
on_validation_epoch_end before on_train_epoch_end within the same
epoch — the adapter emits exactly one job_epoch per epoch). Works on
Lightning 1.9+ and 2.x — both the pytorch_lightning and
lightning.pytorch namespaces are tried.
Keras¶
from argus.integrations.keras import ArgusCallback
cb = ArgusCallback(
project="my-paper",
job_id="mnist_cnn",
model="cnn-32-32-64",
dataset="mnist",
argus_url="http://localhost:8000",
token="em_live_…",
train_loss_keys=("loss",), # default
val_loss_keys=("val_loss",), # default
final_metric_keys=("accuracy", "val_accuracy"), # read from last-epoch logs
)
model.fit(x, y, epochs=10, callbacks=[cb])
Constructor mirrors the Lightning shape; the only differences are
experiment_type / batch_prefix defaults ("keras") and the per-epoch
keys, which read from the per-epoch logs dict Keras passes to
on_epoch_end.
Hook coverage: on_train_begin opens the Reporter and JobContext;
on_epoch_end emits one job_epoch; on_train_end flushes final
metrics and closes both contexts cleanly. Keras has no
on_exception; the callback's destructor best-effort emits a failure
event, and you can call cb.report_failure(exc) from a try/except for
deterministic reporting.
Works with both Keras 3 (import keras) and the TF-bundled Keras 2.
Hydra¶
# configs/config.yaml
hydra:
callbacks:
argus:
_target_: argus.integrations.hydra.ArgusCallback
project: my-paper
experiment_type: forecast
main.py is unchanged. Set ARGUS_URL / ARGUS_TOKEN in the env and every
python main.py … (or -m for a sweep) emits one batch with N jobs.
The adapter handles single-run vs multirun automatically and stamps the
right Hydra job number as job_id. See Hydra callback
for full options (custom job ids, daemon toggles, Optuna study labels for
the Studies tab).
Crash-resume¶
For long sweeps that may crash mid-flight, use derive_batch_id to
guarantee the relaunched run lands on the same Batch row:
from argus import Reporter, derive_batch_id
batch_id = derive_batch_id("my-paper", "dam_forecast") # uses git rev-parse HEAD
with Reporter(source_project="my-paper",
experiment_type="forecast",
n_total=120,
batch_id=batch_id) as r:
...
Same checkout → same id → same Batch row on the backend. See Batch identity & resume.
What gets sent¶
- Heartbeat every 5 minutes (keeps the batch in running state).
- Stop-signal poll every 10 seconds (reads
/api/batches/{id}/stop-requested). - Resource snapshot every 30 seconds (CPU / RSS / GPU util / GPU mem).
- Job epoch when you call
job.epoch(...). - Log line when you call
job.log(...)(off until you call it).
All events carry a UUID event_id; the backend dedupes by it, so retried
posts after a network hiccup are safe.
Troubleshooting¶
| Symptom | Likely cause |
|---|---|
| 401 Unauthorized | Token revoked, or you copied the prefix-only display |
| Batch stuck running | Heartbeat skipped; backend marks it stalled after 15 min (ARGUS_STALL_TIMEOUT_MIN) |
ARGUS_DISABLE=1 set |
SDK no-ops; check the env in the runner |
ARGUS_URL not picked up |
Older code may have used MONITOR_URL — that name is no longer recognised |
For deeper SDK reference see Reporter API.