Skip to content

cents eval

The golden sets ship at 302 premise fixtures and 380 sentiment fixtures (>=6 per EVENT_TAG, prompt-injection negative controls, paired same-headline-opposite-direction fixtures). Output includes bootstrap CIs by default (1000 samples, deterministic at seed=17); text renders as F1: 0.85 [0.81, 0.89], JSON exposes *_ci keys.

cents eval drift-check reads the trailing 7 days from ~/.cents/data/eval_history/ and fires an AlertType.MODEL_DRIFT alert when today’s F1 has fallen more than 5pp below the trailing-7 median. Cheap insurance against silent classifier regressions when the upstream Haiku snapshot bumps. Run it after cents eval run --persist-history so the trailing window has today’s data point in it.

Terminal window
cents eval run --persist-history && cents eval drift-check

Run the LLM eval harness against golden sets.

Terminal window
cents eval <subcommand> [OPTIONS] [ARGS]...
  • cents eval calibrate-thresholds — Search the threshold grid for the pair that maximises band-accuracy.
  • cents eval drift-check — Compare today’s premise_f1 to the trailing-window median; fire a MODEL_DRIFT alert on regression.
  • cents eval golden — Inspect the golden fixture sets.
  • cents eval run — Run the eval against the live Anthropic API.

Search the threshold grid for the pair that maximises band-accuracy.

Hits the live Anthropic API once to score every sentiment-golden fixture. Tests bypass this by injecting synthetic fixtures into calibrate_thresholds() directly.

Synopsis

Terminal window
cents eval calibrate-thresholds [OPTIONS]

Options

OptionTypeDefaultDescription
`—output/-o [textjson]`[text | json]
--limit INTEGERintegerCap fixtures (handy for smoke-testing).
--dry-runbooleanfalsePrint the recommended thresholds without writing thresholds.json.

Example

Terminal window
cents eval calibrate-thresholds [OPTIONS]

Compare today’s premise_f1 to the trailing-window median; fire a MODEL_DRIFT alert on regression.

Synopsis

Terminal window
cents eval drift-check [OPTIONS]

Options

OptionTypeDefaultDescription
--threshold-pp FLOATfloat5.0Drift threshold in percentage points (default: 5).
--window INTEGERinteger7Trailing rows considered for the median (default: 7).
`—output/-o [textjson]`[text | json]

Example

Terminal window
cents eval drift-check [OPTIONS]

Inspect the golden fixture sets.

Synopsis

Terminal window
cents eval golden

Example

Terminal window
cents eval golden

Run the eval against the live Anthropic API.

Synopsis

Terminal window
cents eval run [OPTIONS]

Options

OptionTypeDefaultDescription
`—set [premisesentimentall]`[premise | sentiment | all]
--limit INTEGERintegerCap fixtures per set (handy for smoke-testing).
`—output/-o [textjson]`[text | json]
--gatebooleanfalseCompare to baseline.json; exit non-zero on regression beyond tolerance.
--baseline-f1 FLOATfloatOverride baseline F1 for one-off gating (otherwise reads baseline.json).
--baseline-brier FLOATfloatOverride baseline Brier for one-off gating (otherwise reads baseline.json).
--tolerance-pp FLOATfloat5.0Allowed metric drop in percentage points before —gate fails (default 5).
--persist-baselinebooleanfalseWrite today’s metrics to baseline.json and stamp locked_at.
--persist-historybooleanfalseAppend today’s metrics to ~/.cents/data/eval_history/YYYY-MM-DD.jsonl.

Example

Terminal window
cents eval run [OPTIONS]
Not financial advice. Cents is an educational and research tool for tracking your own investment theses. Outputs are model-generated and may be inaccurate. You are solely responsible for your own investment decisions.