Calibration by subgroup
Each point is one subgroup. Horizontal axis = model's mean predicted retention for that group; vertical axis = weighted observed rate. Dots on the 45° line are perfectly calibrated. Gaps above or below are honest miscalibration — disclosed, not corrected.
| Dimension | Group | n | mean pred | observed | gap |
|---|
Decision curve — net benefit vs threshold
Pick an operating threshold t. The chart shows model net benefit at every threshold between 0.50 and 0.90, plus the treat-all reference (the prevalence ceiling). Wherever the model's curve sits below treat-all, the model adds nothing over selecting everyone; wherever it sits above, selection is beneficial at that threshold.
Net benefit is a weighted combination of true-positive rate and false-positive rate at a given threshold, expressed in the same units as prevalence. Under a synthetic target the absolute scale should be read directionally, not as an operational guide.
Disparate-impact screen — selection rates & four-fifths rule
Pick a protected dimension and a decision threshold. Bars show the weighted selection rate (share of group predicted above threshold) per subgroup. A subgroup passes the four-fifths rule if its selection rate is at least 80% of the highest-rate group on that dimension.
| Group | selection rate | ratio to max | four-fifths |
|---|
How to read this page
The three panels are not independent — they describe different faces of the same small-effect-size model. A few interpretive hooks:
- Calibration without discrimination is real. The AUC is modest (test AUC ~0.55–0.57), but the calibration scatter hugs the 45° line for most subgroups. The model is well-calibrated even where it cannot rank individuals sharply.
- Decision curves translate AUC into action. If the model's net benefit sits below treat-all at your operating threshold, the model adds nothing over selecting everyone in the cohort at that threshold. That is a scope check, not a negative finding.
- The four-fifths screen catches disability-band disparities first. This is expected — the model captures a large retention penalty for higher DRAT ratings. The screen is published so any future reuse of the scores carries that knowledge on the record.
Artifacts behind this page are in reports/phase6_v17/ (stratified) and reports/phase6_v17_temporal/ (temporal). Regeneration scripts: scripts/fit_retention_model_v2.py and scripts/validate_phase6_v17.py.