cents eval
The golden sets ship at 302 premise fixtures and 380 sentiment fixtures
(>=6 per EVENT_TAG, prompt-injection negative controls, paired
same-headline-opposite-direction fixtures). Output includes bootstrap CIs
by default (1000 samples, deterministic at seed=17); text renders as
F1: 0.85 [0.81, 0.89], JSON exposes *_ci keys.
cents eval drift-check reads the trailing 7 days from
~/.cents/data/eval_history/ and fires an AlertType.MODEL_DRIFT alert
when today’s F1 has fallen more than 5pp below the trailing-7 median.
Cheap insurance against silent classifier regressions when the upstream
Haiku snapshot bumps. Run it after cents eval run --persist-history so
the trailing window has today’s data point in it.
cents eval run --persist-history && cents eval drift-checkRun the LLM eval harness against golden sets.
Synopsis
Section titled “Synopsis”cents eval <subcommand> [OPTIONS] [ARGS]...Subcommands
Section titled “Subcommands”cents eval calibrate-thresholds— Search the threshold grid for the pair that maximises band-accuracy.cents eval drift-check— Compare today’s premise_f1 to the trailing-window median; fire a MODEL_DRIFT alert on regression.cents eval golden— Inspect the golden fixture sets.cents eval run— Run the eval against the live Anthropic API.
cents eval calibrate-thresholds
Section titled “cents eval calibrate-thresholds”Search the threshold grid for the pair that maximises band-accuracy.
Hits the live Anthropic API once to score every sentiment-golden fixture.
Tests bypass this by injecting synthetic fixtures into
calibrate_thresholds() directly.
Synopsis
cents eval calibrate-thresholds [OPTIONS]Options
| Option | Type | Default | Description |
|---|---|---|---|
| `—output/-o [text | json]` | [text | json] | |
--limit INTEGER | integer | Cap fixtures (handy for smoke-testing). | |
--dry-run | boolean | false | Print the recommended thresholds without writing thresholds.json. |
Example
cents eval calibrate-thresholds [OPTIONS]cents eval drift-check
Section titled “cents eval drift-check”Compare today’s premise_f1 to the trailing-window median; fire a MODEL_DRIFT alert on regression.
Synopsis
cents eval drift-check [OPTIONS]Options
| Option | Type | Default | Description |
|---|---|---|---|
--threshold-pp FLOAT | float | 5.0 | Drift threshold in percentage points (default: 5). |
--window INTEGER | integer | 7 | Trailing rows considered for the median (default: 7). |
| `—output/-o [text | json]` | [text | json] |
Example
cents eval drift-check [OPTIONS]cents eval golden
Section titled “cents eval golden”Inspect the golden fixture sets.
Synopsis
cents eval goldenExample
cents eval goldencents eval run
Section titled “cents eval run”Run the eval against the live Anthropic API.
Synopsis
cents eval run [OPTIONS]Options
| Option | Type | Default | Description |
|---|---|---|---|
| `—set [premise | sentiment | all]` | [premise | sentiment | all] |
--limit INTEGER | integer | Cap fixtures per set (handy for smoke-testing). | |
| `—output/-o [text | json]` | [text | json] | |
--gate | boolean | false | Compare to baseline.json; exit non-zero on regression beyond tolerance. |
--baseline-f1 FLOAT | float | Override baseline F1 for one-off gating (otherwise reads baseline.json). | |
--baseline-brier FLOAT | float | Override baseline Brier for one-off gating (otherwise reads baseline.json). | |
--tolerance-pp FLOAT | float | 5.0 | Allowed metric drop in percentage points before —gate fails (default 5). |
--persist-baseline | boolean | false | Write today’s metrics to baseline.json and stamp locked_at. |
--persist-history | boolean | false | Append today’s metrics to ~/.cents/data/eval_history/YYYY-MM-DD.jsonl. |
Example
cents eval run [OPTIONS]