跳转至

Reporter API

argus 包对外暴露两个公共类,加一组模块级辅助函数。 其它都是实现细节,可能在小版本之间发生变化。

快速上手

from argus import Reporter

with Reporter("my-run",
              experiment_type="forecast",
              source_project="demo",
              n_total=2) as r:
    with r.job("j1", model="patchtst", dataset="etth1") as j:
        for ep in range(50):
            j.epoch(ep, train_loss=0.5, val_loss=0.6)
            if j.stopped:
                break
        j.metrics({"MSE": 0.21})
        j.upload("outputs/run/visualizations")

外层 with 块会发出 batch_start / batch_done(或异常时 batch_failed),并启动三个守护线程。内层 with r.job(...) 块发出对应的 job_* 事件。

Reporter

argus.Reporter

Context manager for one experiment batch.

On __enter__ it emits batch_start and starts the heartbeat, stop-poll, and resource-snapshot daemon threads. On __exit__ it emits batch_done (or batch_failed if an exception bubbled up), stops the daemons, and drains the underlying event queue.

Parameters:

Name Type Description Default
batch_prefix str

Prefix for the auto-generated batch id ("<prefix>-<12 hex>").

'batch'
experiment_type Optional[str]

Forwarded into the batch_start event.

None
source_project Optional[str]

Forwarded into the batch_start event.

None
command Optional[str]

Forwarded into the batch_start event.

None
n_total Optional[str]

Forwarded into the batch_start event.

None
heartbeat Union[bool, float]

Daemon-thread toggles. True enables with the default interval (300 / 10 / 30 s); False disables; a numeric value overrides the interval (in seconds).

True
stop_polling Union[bool, float]

Daemon-thread toggles. True enables with the default interval (300 / 10 / 30 s); False disables; a numeric value overrides the interval (in seconds).

True
resource_snapshot Union[bool, float]

Daemon-thread toggles. True enables with the default interval (300 / 10 / 30 s); False disables; a numeric value overrides the interval (in seconds).

True
monitor_url Optional[str]

Falls back to env ARGUS_URL or configs/monitor.yaml.

None
token Optional[str]

Falls back to env ARGUS_TOKEN.

None
auto_upload_dirs Optional[Iterable[Union[str, Path]]]

Optional list of directories whose matching files (by extension in {.png,.jpg,.pdf,.svg}) are uploaded as batch artifacts on clean exit. Default None = no auto upload.

None

batch_id property

stopped property

True once the platform's stop button has fired (or local cancel).

job(job_id, *, model=None, dataset=None)

Return a :class:JobContext to use as with r.job(...) as j:.

emit(event, **fields)

Direct emit — escape hatch for unusual events.

JobContext

argus.JobContext

Context manager for a single job inside a :class:Reporter batch.

job_id property

stopped property

Inherits from parent :class:Reporter.

epoch(epoch, *, train_loss=None, val_loss=None, lr=None, batch_time_ms=None, **extra)

Emit one job_epoch event.

metrics(m)

Stash final metrics, surfaced when the job context exits.

log(message, level='INFO')

Emit a log_line event.

upload(path, *, glob='**/*.png')

Upload artifacts from a file or directory.

For a directory, glob selects which files to upload; default **/*.png. Files outside {.png,.jpg,.pdf,.svg} are skipped. Auto-skips when the monitor isn't reachable. Returns the number of files uploaded.

崩溃续跑

Reporter 接受两个新关键字参数:

  • batch_id="…" —— 显式指定批次 id(覆盖 batch_prefix 自动生成)。
  • resume_from="…" —— batch_id 的别名,意图是「续跑」。

配套有一个新的导出函数 derive_batch_id(project, experiment_name, git_sha=None, *, prefix="bench"): 对 (project, experiment_name, git_sha) 三元组做哈希,得到稳定的 <prefix>-<16 hex> 批次 id。同一份 checkout 重启同一个实验落到 同一个 Batch 行上 —— 后端 _handle_batch_start 对 重复 batch_start 是幂等的。

from argus import Reporter, derive_batch_id

batch_id = derive_batch_id("my-bench", "dam_forecast")
with Reporter(batch_prefix="bench",
              source_project="my-bench",
              n_total=120,
              batch_id=batch_id) as r:
    ...

完整流程见 批次身份与续跑

模块级辅助函数

argus.derive_batch_id(project, experiment_name, git_sha=None, *, prefix=_DEFAULT_PREFIX)

Return f"{prefix}-{16-hex}" derived from project + experiment + git.

Parameters:

Name Type Description Default
project str

Project / namespace name; usually cfg.monitor.project.

required
experiment_name str

The experiment-level identifier — for sibyl this is cfg.experiment_name (e.g. etth1_transformer); for a multi-experiment sweep, the launcher's batch tag.

required
git_sha Optional[str]

Caller-supplied commit hash. None (default) triggers a git rev-parse HEAD lookup; a string "" is treated as "no git" too. Pass an explicit string in tests for determinism.

None
prefix str

Leading token of the returned id (default "bench").

_DEFAULT_PREFIX

Returns:

Type Description
str

A reproducible id of the form "<prefix>-<16 hex>". Re-running the same command from the same checkout yields the same id, so the resumed events append to the existing Batch row instead of forking a new one.

Notes

16 hex chars = 64 bits of SHA-256. That's plenty for collision avoidance across a single project's experiment space; we'd need ~4 billion concurrent batches before a clash becomes likely.

argus.new_batch_id(prefix='batch')

Return a fresh <prefix>-<12 hex> batch id.

argus.set_batch_id(batch_id)

Set the process-wide current batch id (used by :func:emit).

argus.get_batch_id()

Return the process-wide current batch id, or None.

argus.emit(event, **fields)

Module-level escape hatch — direct emit on the active reporter.

No-op when no :class:Reporter is active. Safe to call from any thread; never raises.

argus.sub_env(template, **extra)

Tiny ${VAR} / $VAR substitution helper.

${ARGUS_URL} -> os.environ['ARGUS_URL']. Missing keys are left as the literal placeholder. Used by configs that want to interpolate env-var values without pulling in OmegaConf.

环境变量

变量 作用 默认值
ARGUS_URL 平台基地址,如 http://localhost:8000 未设 → 进 no-op 模式
ARGUS_TOKEN reporter 范围 API token(em_live_… 未设
ARGUS_DISABLE 设为 1 / true 时彻底关闭事件上报 0

no-op 模式下,SDK 静默吃掉所有 emit。这正是把埋点留在共享训练脚本 里需要的行为 —— 在没有面板的机器上不会崩。

安装额外项

内容
argus-reporter 基础 SDK
argus-reporter[lightning] + PyTorch Lightning 回调
argus-reporter[keras] + Keras 回调
argus-reporter[hydra] + Hydra 回调
argus-reporter[all-integrations] 以上全部

错误处理

所有公共方法都不抛异常。网络错误以 DEBUG 级别记录后入队重试。 如果队列满了、本地 spill 文件也满了,事件会被丢弃 —— 训练第一, SDK 永远不能阻塞用户代码。

线程安全

ReporterJobContext 在用户多线程之间不是线程安全的。它们 启动的守护线程之间是。如果你需要从多个工作线程发事件,自己加锁, 或者用模块级的 emit(...) 辅助函数。