Chunk 0
--- Page 1 ---
Unbiased Prevalence Estimation with Multicalibrated LLMs
Fridolin Linder1
Thomas Leeper1
Daniel Haimovich1
Niek Tax1
Lorenzo Perini1
Milan Vojnovic1,2
1Meta Platforms Inc., 2The London School of Economics and Political Science
Corresponding author: Fridolin Linder (flinder@meta.com)
Classification: Social Sciences / Political Science; Physical Sciences / Statistics
Keywords: multicalibration | large language models | prevalence estimation | covariate shift | quantification
1
--- Page 2 ---
Significance
Large language models are increasingly used as measurement devices to estimate prevalence in populations. A critical but overlooked problem arises when the target population differs from the validation population:
standard methods produce biased prevalence estimates, even when the model achieves high classification
accuracy.
Chunk 1
We show that multicalibration, requiring a device to be accurate conditional on input fea-
tures—rather than just on average—is sufficient for unbiased prevalence estimation under covariate shift. Our theoretical and empirical results imply that the rapidly growing body of LLM-based measurement
research is vulnerable to systematic bias that can be mitigated by enforcing multicalibration.
Chunk 2
Abstract
Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic
tests, classifiers, or large language models) is fundamental to science, public health, and online trust and
safety. Standard approaches correct for known device error rates but assume these rates remain stable
across populations.
Chunk 3
We show this assumption fails under covariate shift and that multicalibration, which
enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased
prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this
guarantee.
Chunk 4
Our work connects recent theoretical work on fairness to a longstanding measurement problem
spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias
growing with shift magnitude, while a multicalibrated estimator maintains near-zero bias.
Chunk 5
While we focus
the discussion mostly on LLMs, our theoretical results apply to any classification model. Two empirical
applications—estimating employment prevalence across U.S.
Chunk 6
states using the American Community Survey,
and classifying political texts across four countries using an LLM—demonstrate that multicalibration
substantially reduces bias in practice, while highlighting that calibration data should cover the key feature
dimensions along which target populations may differ. Introduction
Large language models (LLMs) are increasingly used as measurement devices for estimating the prevalence
of a category in a population.
Chunk 7
Researchers now routinely deploy LLMs as zero-shot classifiers to estimate
the frequency of phenomena that previously required expensive manual annotation: coding democracy
indicators across countries (Weidmann et al. 2026), classifying protest events in news corpora (Overos
et al.
Chunk 8
2024), estimating party policy positions from manifestos across dozens of countries and languages
(Benoit et al. 2026), categorizing open-ended survey responses at near-human accuracy (Mellon et al.
Chunk 9
2024; Gilardi et al. 2023), extracting diagnostic attributes from pathology reports (Sushil et al.
Chunk 10
2024),
identifying goals-of-care discussions in clinical notes (Lee et al. 2025), annotating art forms in auction
records (Tojima and Yoshida 2025), and converting qualitative text into quantitative variables across
multiple languages (Karjus 2025).
Chunk 11
LLMs are also being deployed for content moderation1 and as LLM
“judges” to assess the quality and safety of AI systems (Zheng et al. 2023; Yuan et al.
Chunk 12
2024). These applications are typically validated by reporting discriminative performance: accuracy, AUC, or
agreement with human annotations.
Chunk 13
Researchers then use or advocate for using validated LLMs as
measurement devices across new populations, time periods, or subgroups. But strong discriminative
performance does not guarantee accurate prevalence estimates from the classifications provided by these
devices.
Chunk 14
When the composition of the target population differs from the validation population, a model
that separates positives from negatives well can still produce prevalence estimates that are substantially
1https://openai.com/index/using-gpt-4-for-content-moderation/
2
--- Page 3 ---
biased, because discrimination is invariant to the calibration errors that drive prevalence bias. This
problem is often missed in standard validation practice and, as we show, can produce large bias even with
strong classification performance.
Chunk 15
Neither the confidence elicitation literature nor the quantification2 literature solves this problem. Confidence
elicitation methods — verbalized confidence (Tian et al.
Chunk 16
2023), token log-probabilities (Kadavath et al. 2022), consistency sampling (Wang et al.
Chunk 17
2023) — replace the LLM’s bare binary classification with a
probability score, and post-hoc calibration (Oliveira et al. 2025) can further refine those scores.
Chunk 18
Both target
global calibration: calibration on the population where they are evaluated. Empirical evidence confirms
that such calibration does not transfer: it degrades 2–3× under language shift even when accuracy is
preserved (Yang et al.
Chunk 19
2023), and varies widely across tasks and domains (Ren et al. 2025).
Chunk 20
Quantification
methods such as “Classify & Count”, “Adjusted Count”, and the Saerens-Latinne-Decaestecker (SLD) EM
algorithm (see González et al. (2017) for a detailed survey) attempt to correct for classifier errors directly,
but rely on the assumption that error properties remain static across populations.
Chunk 21
Importance-weighted
methods handle covariate shift in principle but require re-estimating density ratios for each new target
population. None of these approaches provides a measurement device that can be validated once and then
reliably applied to new populations.
Chunk 22
We show that multicalibration — calibration conditional on the input features, not just on average — fills
this gap. Building on Kim et al.
Chunk 23
(2022)‘s “Universal Adaptability” result, we show that a multicalibrated
device requires no target-specific estimation: it is calibrated once on source data and produces unbiased
prevalence estimates on any target population whose features lie within the calibration support. This
property is critical for LLM-based measurement, where a single device is typically applied to many
populations without the opportunity to re-calibrate for each one.
Chunk 24
Multicalibration can be applied to
any LLM output — from discrete Yes/No classifications (the standard practice in applied research) to
continuous confidence scores — because it operates on the scores’ relationship to outcomes conditional
on features, not on the scores’ intrinsic quality. Our results apply more generally to any model-based
measurement device, but the LLM setting is where the practical need is most urgent and the calibration
problem most overlooked.
Chunk 25
We review existing approaches to the quantification problem, connect the quantification task to the broader
theory of domain adaptation via multicalibration, and apply those insights to two empirical applications:
a controlled demonstration using a standard ML classifier to estimate employment prevalence under age
distribution shift (American Community Survey), followed by the main application of classifying political
texts across four countries using an LLM as a zero-shot measurement device. Results
Standard calibration methods fail under covariate shift
Consider a binary outcome Y ∈{0, 1}, features X, and a device h(X) ∈[0, 1] producing probabilistic
predictions.
Chunk 26
The goal is to estimate population prevalence π = P(Y = 1) in a target population using
only unlabeled target data {Xi} and the device h. We focus on covariate shift (Storkey 2009), where
P(X) changes across populations but P(Y | X) remains stable.
Chunk 27
This is the natural assumption when
features causally influence outcomes (X →Y ), as in many measurement applications. See Limitations for
discussion of label shift (Y →X) and concept drift.
Chunk 28
A device is globally calibrated if E[Y | h(X) = p] = p for all prediction values p. Under global calibration,
E[h(X)] = E[Y ] = π, so the sample mean of predictions is an unbiased prevalence estimate.
Chunk 29
However,
2The task of estimating prevalence from imperfect classifiers is known as quantification in the machine learning literature;
we use this term interchangeably with prevalence estimation. 3
--- Page 4 ---
global calibration is a property of a specific population.
Chunk 30
A device calibrated on one population need not
be calibrated on another, even under covariate shift with stable P(Y | X). The failure mechanism is as follows.
Chunk 31
Suppose the population consists of subgroups G with weights wG. Within each subgroup, the device may be biased: let ϵG = E[Y −h(X) | X ∈G] denote the mean prediction
error within group G.
Chunk 32
Global calibration requires only that these errors cancel on average: P
G wGϵG = 0. Under covariate shift, the group weights change to w∗
G, and the bias in the prevalence estimate becomes
P
G w∗
GϵG, which is generally nonzero unless ϵG = 0 for all G.
Chunk 33
A device can be perfectly globally calibrated,
with errors that precisely cancel in the training population, while having nonzero mean prediction error
within every individual subgroup. Standard quantification methods are each vulnerable to this failure mechanism (González et al.
Chunk 34
2017). Classify & Count, Rogan-Gladen adjustment (Rogan and Gladen 1978), and Probabilistic Adjusted Classify
& Count (PACC) all rely on error rates or conditional score means estimated from calibration data; these
are weighted averages over subgroups and shift with population composition.
Chunk 35
The SLD/EMQ algorithm
(Saerens et al. 2002) and more recent distribution-matching methods such as DyS and HDy (Maletzke et
al.
Chunk 36
2018) assume label shift (P(X | Y ) stable) and fail under covariate shift. Even global calibration via
isotonic regression, recommended by Wu and Resnick (2024) for covariate shift settings, produces biased
prevalence estimates because global calibration does not guarantee feature-conditional accuracy.
Chunk 37
Multicalibration guarantees unbiased prevalence estimation
The analysis above shows that all standard methods produce biased prevalence estimates under covariate
shift, with bias P
G w∗
GϵG that is generally nonzero unless every ϵG = 0. Multicalibration (Hébert-Johnson et al.
Chunk 38
2018) formalizes this requirement. A predictor f(X) is multical-
ibrated with respect to a collection of subgroups G if E[Y | f(X) = v, X ∈G] = v for every G ∈G and
prediction value v.
Chunk 39
When G is rich enough to capture all relevant structure in X, this is equivalent to
requiring calibration conditional on the full feature space:
E[Y | f(X) = v, X = x] = v
for all x and v with sufficient probability mass. This is the framing of Kim et al.
Chunk 40
(2022)’s “Universal
Adaptability”: a predictor calibrated conditional on X yields correct expected values under any reweighting
of P(X), without requiring knowledge of the shift. The connection to prevalence estimation is direct.
Chunk 41
If f is calibrated conditional on X, then E[f(X) | X =
x] = E[Y | X = x] for all x. Under covariate shift, where only P(X) changes while P(Y | X) remains
stable, the law of iterated expectations gives (where E∗and P ∗denote expectations and probabilities
under the target distribution, and π∗= P ∗(Y =1) is the target prevalence):
E∗[f(X)] =
Z
E[f(X) | X = x] dP ∗(x) =
Z
E[Y | X = x] dP ∗(x) = E∗[Y ] = π∗.
Chunk 42
Because predictions are correct at each point in the feature space, the average of predictions tracks the
true prevalence under any change in P(X). The condition strictly necessary for this robustness is multi-accuracy: E[f(X)−Y | X ∈G] = 0 for all groups
G that the shift can reweight.
Chunk 43
Multicalibration, which additionally requires E[Y | f(X) = v, X ∈G] = v,
is strictly stronger and implies multi-accuracy. Multicalibration is the preferable target for three reasons.
Chunk 44
First, practical post-hoc algorithms like MCGrad (Tax et al. 2026)3 naturally produce multicalibrated
3An open-source implementation is available at https://mcgrad.dev.
Chunk 45
4
--- Page 5 ---
predictors (Hébert-Johnson et al. 2018), so the stronger guarantee comes at no additional cost.
Chunk 46
Second,
multicalibration covers a broader set of use cases where multi-accuracy no longer suffices and full calibration
conditional on both features and score level is required: when the population shift is mediated by the
device’s scores (e.g., using the device’s scores to decide which items to label, or threshold-based filtering),
when prevalence is estimated within score strata, or when scores are used as inputs to downstream
regressions. Third, multicalibration provides robustness against misspecification of which features drive
the shift, since a predictor multicalibrated with respect to a rich class of subgroups is automatically
multi-accurate for any sub-partition.
Chunk 47
Both conditions require that the calibration features capture the
dimensions along which the population shift occurs (an ignorability assumption). Simulation: standard methods fail, multicalibration succeeds
We illustrate the theoretical results above with a simulation.
Chunk 48
We generate data with a binary covariate
X ∈{0, 1} and binary outcome Y , where P(Y = 1 | X = 1) = 0.85 and P(Y = 1 | X = 0) = 0.15. We
simulate a classifier with hardcoded, systematically biased predictions: 10% underestimation when X = 0
and 10% overestimation when X = 1.
Chunk 49
All calibration parameters are learned on a balanced training
distribution (P(X = 0) = 0.5). Prevalence is estimated by averaging the predicted probabilities over the
target sample.
Chunk 50
We then evaluate prevalence estimates on test distributions where P(X = 0) ranges from
0.01 to 0.99, repeating 50 times (details in Materials and Methods). Figure 1: Prevalence estimation bias (% relative) under covariate shift, averaged over 50 simulation runs.
Chunk 51
The x-axis shows ∆P(X =0), the change in P(X =0) from the training value of 0.5. Classify & Count
and Rogan-Gladen diverge with increasing shift; isotonic regression (global calibration) shows moderate
bias; MCGrad maintains near-zero bias across all shift levels.
Chunk 52
Classify & Count and Rogan-Gladen curves
are cropped at the ±40% axis limits; their bias continues to grow beyond this range, exceeding ±100% at
extreme shifts (see SI Appendix Figure S3 for full range). Additional methods (PACC, SLD, uncalibrated
averaging) also shown in SI Appendix Figure S3.
Chunk 53
Figure 1 shows the results. At the training distribution (center), all methods produce approximately
unbiased estimates.
Chunk 54
As the distribution shifts, the methods diverge. Rogan-Gladen is particularly unstable:
its ratio structure amplifies estimation errors, with bias exceeding ±40% at extreme shifts.
Chunk 55
Classify &
Count shows growing bias in the same direction. Isotonic regression (global calibration) shows more
5
--- Page 6 ---
moderate but still substantial bias (up to 15% at extreme shifts).
Chunk 56
SLD and PACC exhibit comparably
large failures (SI Appendix Figure S3). The multicalibrated estimator maintains near-zero bias and the
lowest RMSE across the entire range of distribution shifts (SI Appendix Figure S2), confirming that the
bias reduction does not come at the cost of increased variance.
Chunk 57
Empirical application: employment prevalence under age distribution shift
Before applying multicalibration to LLM-generated scores, which present additional challenges such as
coarse discretization and non-standard score distributions, we first demonstrate the core mechanism in
a controlled setting with a traditional machine learning classifier. We analyze data from the American
Community Survey (ACS), a large-scale annual survey conducted by the U.S.
Chunk 58
Census Bureau. The concept
to be measured is the rate of employment.
Chunk 59
The measurement device is a logistic regression model of binary
employment status, with 16 sociodemographic features including age, education, marital status, disability
status, and citizenship. Setup.
Chunk 60
We train a logistic regression classifier on data from eight U.S. states (TX, MI, PA, OH, IL,
GA, NC, VA) across 2016–2018, totaling approximately 1.5 million training observations.
Chunk 61
The remaining
in-distribution data is split into a calibration set (n ≈644,000) and a test set (n ≈920,000). Six additional
states (CA, NY, FL, WA, AZ, CO) are held out for out-of-distribution (OOD) evaluation.
Chunk 62
We fit two
post-hoc calibration methods on the calibration set: isotonic regression (global calibration) and MCGrad, a
multicalibration algorithm that enforces calibration conditional on both categorical and numerical features. We compare five prevalence estimation methods (Figure 2): Classify & Count with a prevalence-matched
threshold, Rogan-Gladen adjustment, importance-weighted estimation (IPW), isotonic regression, and
MCGrad.
Chunk 63
Additional methods (PACC, SLD, uncalibrated averaging) are reported in SI Appendix Table
S1. Age distribution shift.
Chunk 64
Employment rates vary dramatically by age: approximately 47% for ages 16–24,
76% for ages 25–54, 61% for ages 55–64, and 17% for ages 65+. This makes age an ideal dimension along
which to construct meaningful distribution shifts.
Chunk 65
We create synthetic target populations by resampling
test data with shifted age distributions: young-skewed (oversampling ages 16–30), old-skewed (oversampling
ages 60+), and bimodal (oversampling both tails). The resulting populations have true employment rates
ranging from 12.8% to 46.0%.
Chunk 66
All calibration parameters are estimated once on the original calibration set
and held fixed across scenarios. Figure 2: Prevalence estimation |bias| (percentage points) under synthetic age distribution shift, for
in-distribution states (left) and out-of-distribution states (right).
Chunk 67
Each marker shape represents a different
age shift scenario; horizontal lines show the mean across scenarios. Full numerical results in SI Appendix
Table S1.
Chunk 68
6
--- Page 7 ---
Results. With no age shift, all methods produce approximately unbiased estimates (Figure 2).
Chunk 69
Under
shift, the methods diverge sharply. Rogan-Gladen fails severely (12–19pp bias under age shift).
Chunk 70
SLD,
designed for label shift rather than covariate shift, shows comparably large bias (SI Appendix Table S1). Classify & Count and isotonic regression show moderate but growing bias (up to 8pp under age shift).
Chunk 71
IPW
performs well on simple shifts (≤1.2pp) but fails on the bimodal shift (+4.7pp in-distribution, +6.3pp
OOD) where the density ratio is hard to model. MCGrad produces near-zero bias across all in-distribution scenarios (≤0.27pp), including the bimodal
shift where IPW struggles.
Chunk 72
In the OOD setting (held-out states with age shift), MCGrad’s bias increases
modestly (0.88–1.35pp), reflecting geographic shift along an uncalibrated dimension. Even so, MCGrad
maintains the lowest bias and RMSE across all scenarios.
Chunk 73
LLM-based topic classification under cross-national shift
The ACS application uses a standard machine learning classifier with well-distributed probability estimates. We now examine the setting that more immediately motivates this paper: using an LLM as a zero-shot
measurement device across multiple countries and languages.
Chunk 74
We compare two LLM output modes
that reflect current practice and the confidence elicitation literature, respectively: (1) discrete Yes/No
classifications, which is how LLMs are used in all of the applied studies cited in the introduction, and (2)
probability scores obtained via direct probability elicitation, where the LLM estimates P(Yes) and P(No)
without first committing to an answer. This design lets us test whether confidence elicitation improves
prevalence estimation, and whether multicalibration is needed in either case.
Chunk 75
Setup. We use Claude Opus 4.6 to classify 30,000 political texts from the Comparative Agendas Project
(Baumgartner et al.
Chunk 76
2006) as related to Law & Crime (CAP major topic code 12). The data comprises six
sub-populations across four countries and four languages (5,000 documents each): Danish parliamentary
questions, Spanish oral questions, U.S.
Chunk 77
congressional bills, Belgian newspaper articles, Spanish media
articles, and Belgian TV news. Each document is classified twice in independent campaigns: once for a
binary Yes/No label, and once for direct probability estimates P(Yes) and P(No) (summing to 1.0).
Chunk 78
The
LLM achieves strong discriminative performance: AUC 0.960 for binary labels and 0.987 for probability
scores on the in-distribution test set, with per-language AUCs of 0.983–0.994 for the probability scores. The calibration set (n ≈13,400) draws equally from four sub-populations (Denmark questions, Spain
questions, U.S.
Chunk 79
bills, Belgium newspaper), ensuring that all four countries and languages are represented. Two sub-populations are held out as out-of-distribution targets: Spanish media and Belgian TV news,
which share country and language with calibration data but introduce an unseen document type.
Chunk 80
MCGrad
is calibrated with categorical features (country, document type, party) and two numerical features (decade,
document length). For the binary-label condition, MCGrad receives the LLM’s Yes/No classification as a categorical input
feature, with all initial scores set to the calibration-set base rate.
Chunk 81
MCGrad then learns feature-conditional
prevalence estimates from the label and metadata alone, without any probability score from the LLM. For the probability-score condition, MCGrad receives the LLM’s P(Yes) directly as the input score and
calibrates it conditional on the same metadata features.
Chunk 82
Results. Figure 3 shows prevalence estimation bias across five scenarios: a baseline with no shift, two
within-calibration shifts (country composition, document type composition), and two OOD scenarios.
Chunk 83
7
--- Page 8 ---
Figure 3: Prevalence estimation |bias| (percentage points) across the shift gradient. Each marker shape
represents a different shift scenario; horizontal lines show the mean across scenarios.
Chunk 84
MCGrad achieves
near-zero bias within the calibration distribution in both binary-label and probability-score conditions. Full
numerical results in SI Appendix Table S2.
Chunk 85
The failure pattern is clearly visible: Classify & Count, the standard practice of counting positive LLM
classifications, produces bias of +1.6 to +4.8pp across all scenarios. The Rogan-Gladen adjustment reduces
bias within calibration but still shows +3.1 to +3.7pp on OOD populations.
Chunk 86
Isotonic regression shows
moderate bias within calibration (≤0.9pp) but larger OOD bias (2.5–2.6pp). SLD, designed for label shift,
shows very large bias (+5 to +22pp; SI Appendix Table S2).
Chunk 87
MCGrad on binary labels achieves near-zero bias within the calibration distribution (≤0.4pp) and degrades
gracefully on OOD targets: +0.7pp on Belgian TV and -1.9pp on Spanish media. This is achieved using
only the LLM’s Yes/No classification and document metadata, with no probability scores.
Chunk 88
MCGrad on
probability scores achieves comparable within-calibration performance (≤0.4pp) but shows larger OOD
bias on Spanish media (-4.5pp) and Belgian TV (+1.6pp). The finding that binary labels with MCGrad
can match or outperform probability scores with MCGrad suggests that the metadata features, not the
input score quality, drive the calibration improvement.
Chunk 89
IPW, the standard covariate-shift method, achieves near-zero within-calibration bias but fails severely on
OOD populations (-12.2pp on Spanish media) because the target contains feature values absent from the
source (a positivity violation). Unlike MCGrad, IPW also requires re-estimating density ratios for each
target population.
Chunk 90
Comparison with the ACS application. Across both applications, the pattern is consistent: MCGrad
achieves near-zero bias when the target population’s features are within the calibration support, and
degrades when the shift is along an uncalibrated dimension.
Chunk 91
IPW matches MCGrad within calibration but
fails more severely on OOD targets. MCGrad’s practical advantage is that it requires no target-specific
estimation: a single calibrated device can be applied to any target population.
Chunk 92
This is the “universal
adaptability” property of Kim et al. (2022).
Chunk 93
We replicate the CAP analysis using Llama 3.3 70B Instruct
(an open-weight model) in SI Appendix S2; the results are consistent, confirming that the findings are not
specific to a particular LLM. Discussion
We have shown that multicalibration—calibration conditional on the input features rather than just
on average—is sufficient for accurate model-based prevalence estimation under population shift.
Chunk 94
The
minimal theoretical requirement is multi-accuracy (correct predictions on average within each subgroup),
8
--- Page 9 ---
which multicalibration implies. Both conditions require that the calibration features capture the relevant
dimensions of population shift, and both require that the target population’s features lie within the support
of the calibration distribution.
Chunk 95
Both empirical applications confirm this: when calibrated features overlap
with the target population, MCGrad achieves near-zero bias; when the target introduces a novel feature
value, bias increases with the severity of the shift. Practitioners should ensure their calibration data covers
the key dimensions along which target populations may differ.
Chunk 96
A central practical advantage of multicalibration is that it produces a target-independent measurement
device. Unlike importance-weighted methods, which require re-estimating density ratios for each new
target and fail under positivity violations (IPW: -12.2pp on CAP Spanish media, vs.
Chunk 97
MCGrad: -1.9pp), a
multicalibrated device is calibrated once and deployed without target-specific re-estimation. This is the
“universal adaptability” property of Kim et al.
Chunk 98
(2022). Our CAP results further show that MCGrad on discrete binary labels achieves comparable or better
prevalence estimation than MCGrad on continuous probability scores, suggesting that metadata features
matter more than input score quality for prevalence estimation under shift.
Chunk 99
This has immediate practical
implications: researchers using the standard workflow of prompting an LLM for Yes/No classifications can
apply multicalibration directly to those labels, without needing confidence elicitation. Multicalibration
does not require access to a model’s internals: observable metadata (document source, language, text
length) can serve as segment features (Detommaso et al.
Chunk 100
2024). Researchers who currently validate
by reporting accuracy or AUC (Grimmer et al.
Chunk 101
2022) should be aware that these metrics provide no
information about the calibration errors that drive prevalence bias. Several limitations warrant discussion.
Chunk 102
First, the guarantee holds only for shifts along calibrated dimensions;
out-of-domain performance degrades when target populations introduce novel feature values. Second,
multicalibration requires labeled calibration data of sufficient size.
Chunk 103
Our two applications show MCGrad
performing well with both large (644K, ACS) and moderate (13.4K, CAP) calibration sets, but the
minimum required depends on the complexity of the feature space. In zero-shot LLM settings, calibration
labels may require the manual annotation LLMs were intended to avoid; prediction-powered inference
(Angelopoulos et al.
Chunk 104
2023) offers a complementary framework. Third, our framework assumes covariate
shift (P(Y | X) stable); under concept drift, no purely statistical correction substitutes for new labeled
data.
Chunk 105
Finally, both applications use settings where ground-truth labels enable direct bias measurement; in
practice, practitioners may lack such labels. These results connect two literatures that have developed in isolation.
Chunk 106
The quantification literature
has focused on correcting aggregate error rates but has not engaged with feature-conditional calibration
(González et al. 2017; Wu and Resnick 2024).
Chunk 107
The multicalibration literature has focused on individual-level
prediction quality and fairness but has not emphasized the implications for population-level inference
(Hébert-Johnson et al. 2018; Kim et al.
Chunk 108
2022). Our contribution is to show that calibration conditional on
the feature space is what makes LLM-based prevalence estimation reliable under the distribution shifts
that motivate its use.
Chunk 109
Materials and Methods
Simulation. We generate n = 10,000 observations with a binary covariate X ∈{0, 1} and outcome
Y ∼Bernoulli(P(Y | X)), where P(Y = 1 | X = 1) = 0.85 and P(Y = 1 | X = 0) = 0.15.
Chunk 110
The
classifier produces deterministic scores ˆp(X = 0) = 0.135 and ˆp(X = 1) = 0.935, representing 10%
multiplicative bias within each stratum. For each of B = 50 iterations, we generate fresh calibration data
from P(X = 0) = 0.5, estimate all method-specific parameters, then evaluate bias and mean squared
error on 20 test distributions with P(X = 0) ranging from 0.01 to 0.99.
Chunk 111
The seven methods compared
(uncalibrated averaging, Classify & Count with a prevalence-matched threshold, Rogan-Gladen adjustment,
9
--- Page 10 ---
PACC, SLD/EMQ, global multiplicative calibration, and multicalibration with stratum-specific additive
corrections) are detailed with full mathematical definitions in the SI Appendix. Empirical application (ACS).
Chunk 112
We use the ACS Public Use Microdata Sample via the folktables package,
with the ACSEmployment prediction task (binary: employed vs. not employed) and 16 sociodemographic
features.
Chunk 113
Training data comprises eight states (TX, MI, PA, OH, IL, GA, NC, VA) across 2016–2018. The base model is logistic regression with standard scaling.
Chunk 114
Post-hoc calibration uses isotonic regression
(global) and MCGrad (multicalibration with categorical and numerical segment features) on a held-out
calibration set (n ≈644,000). In addition to Classify & Count, Rogan-Gladen, isotonic regression,
and MCGrad, we include importance-weighted prevalence estimation (IPW), which estimates density
ratios between source and target via logistic regression on all 16 features, and two methods from the
quantification literature: PACC (Probabilistic Adjusted Classify & Count) (González et al.
Chunk 115
2017) and
SLD (Saerens-Latinne-Decaestecker) (Saerens et al. 2002).
Chunk 116
Uncalibrated averaging is also reported in SI
Appendix Table S1. Synthetic age-shifted populations are created by importance-weighted resampling
of the test set (n ≈920,000), with exponential weights favoring young ages, old ages, or both extremes.
Chunk 117
RMSE is computed via bootstrap resampling (200 iterations per scenario). Six additional states (CA, NY,
FL, WA, AZ, CO) serve as OOD evaluation data.
Chunk 118
Empirical application (CAP). We use data from the Comparative Agendas Project (Baumgartner
et al.
Chunk 119
2006), which provides expert-coded policy topic labels for political texts across countries. The
binary outcome is whether a document addresses Law & Crime (CAP major topic code 12).
Chunk 120
We sample
30,000 documents (5,000 from each of six sub-populations: Danish parliamentary questions, Spanish oral
questions, U.S. congressional bills, Belgian newspaper articles, Spanish media articles, and Belgian TV
news).
Chunk 121
The measurement device is Claude Opus 4.6. Each document is classified in two independent
campaigns: (1) binary Yes/No classification using the CAP codebook definition of Law & Crime, and (2)
direct probability elicitation, where the LLM estimates P(Yes) and P(No) without first committing to an
answer.
Chunk 122
The two campaigns were run independently to avoid anchoring contamination. The calibration set
(n ≈13,400) draws equally from four sub-populations (Denmark questions, Spain questions, U.S.
Chunk 123
bills, and
Belgium newspaper). MCGrad is calibrated with categorical features (country, document type, political
party) and two numerical features (decade and document length).
Chunk 124
For the binary-label condition, the
LLM’s Yes/No classification is passed to MCGrad as a categorical input feature, with initial scores set to
the calibration-set base rate; MCGrad learns feature-conditional prevalence estimates from the metadata
and LLM label alone. For the probability-score condition, MCGrad receives P(Yes) as the input score.
Chunk 125
Classify & Count reports the fraction of positive LLM classifications. Rogan-Gladen adjustment corrects
this fraction using binary-label TPR and FPR estimated on the calibration set.
Chunk 126
Isotonic regression is fitted
on probability scores using the same calibration set. IPW estimates density ratios via logistic regression
on country, document type, decade, and document length.
Chunk 127
SLD results are reported in SI Appendix Table
S2. Two sub-populations are held out as out-of-distribution targets: Spanish media and Belgian TV
news, which differ from the calibration data only in document type while sharing the same countries and
languages.
Chunk 128
Software. MCGrad is available at https://github.com/facebookincubator/MCGrad.
Chunk 129
Simulation and
analysis code are available at https://github.com/facebookresearch/multicalibrated_llm_measurement. Competing Interests
The authors declare no competing interests.
Chunk 130
10
--- Page 11 ---
Disclosure of Delegation to Generative AI
The authors declare the use of generative AI in the research and writing process. According to the GAIDeT
taxonomy (2025), the following tasks were delegated to GAI tools under full human supervision:
• Literature search and systematization
• Code generation
• Code optimization
• Data collection
• Data cleaning
• Data analysis
• Visualization
• Reproducibility testing
• Text generation
• Proofreading and editing
• Reformatting
• Identification of limitations
The GAI tools used were: Claude Opus 4.6, Claude Opus 4.7, Gemini 3 Pro.
Chunk 131
Responsibility for the final
manuscript lies entirely with the authors. GAI tools are not listed as authors and do not bear responsibility
for the final outcomes.
Chunk 132
Declaration submitted by: Fridolin Linder. AI was involved for the listed tasks but
no task was exclusively done by AI.
Chunk 133
All outputs were manually verified and iterated on by the authors. Data Availability
The American Community Survey data is publicly available via the folktables package.
Chunk 134
The Comparative
Agendas Project data is publicly available at https://www.comparativeagendas.net. Simulation and
analysis code are available at https://github.com/facebookresearch/multicalibrated_llm_measurement.
Chunk 135
References
Angelopoulos, Anastasios N, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. 2023.
Chunk 136
“Prediction-Powered Inference.” Science 382 (6671): 669–74. Baumgartner, Frank R., Christoffer Green-Pedersen, and Bryan D.
Chunk 137
Jones. 2006.
Chunk 138
“Comparative Studies of
Policy Agendas.” Journal of European Public Policy 13 (7): 959–74. Benoit, Kenneth, Scott De Marchi, Conor Laver, Michael Laver, and Jinshuai Ma.
Chunk 139
2026. “Using Large
Language Models to Analyze Political Texts Through Natural Language Understanding.” American
Journal of Political Science, ahead of print.
Chunk 140
https://doi.org/10.1111/ajps.70050. Detommaso, Gianluca, Martin Bertran, Riccardo Fogliato, and Aaron Roth.
Chunk 141
2024. “Multicalibration for
Confidence Scoring in LLMs.” Proceedings of the 41st International Conference on Machine Learning
(ICML).
Chunk 142
Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023.
Chunk 143
“ChatGPT Outperforms Crowd-Workers for
Text-Annotation Tasks.” Proceedings of the National Academy of Sciences 120 (30): e2305016120. 11
--- Page 12 ---
González, Pablo, Alberto Castaño, Nitesh V.
Chunk 144
Chawla, and Juan José Del Coz. 2017.
Chunk 145
“A Review on
Quantification Learning.” ACM Computing Surveys 50 (5): 1–40. Grimmer, Justin, Margaret E.
Chunk 146
Roberts, and Brandon M. Stewart.
Chunk 147
2022. Text as Data: A New Framework
for Machine Learning and the Social Sciences.
Chunk 148
Princeton University Press. Hébert-Johnson, Ursula, Michael P.
Chunk 149
Kim, Omer Reingold, and Guy N. Rothblum.
Chunk 150
2018. “Multicalibration:
Calibration for the (Computationally-Identifiable) Masses.” Proceedings of the 35th International
Conference on Machine Learning (ICML), 1939–48.
Chunk 151
Kadavath, Saurav, Tom Conerly, Amanda Askell, et al. 2022.
Chunk 152
“Language Models (Mostly) Know What
They Know.” arXiv Preprint arXiv:2207.05221. Karjus, Andres.
Chunk 153
2025. “Machine-Assisted Quantitizing Designs: Augmenting Humanities and Social
Sciences with Artificial Intelligence.” Humanities & Social Sciences Communications.
Chunk 154
Kim, Michael P., Christoph Kern, Shafi Goldwasser, Frauke Kreuter, and Omer Reingold. 2022.
Chunk 155
“Universal
Adaptability: Target-Independent Inference That Competes with Propensity Scoring.” Proceedings of
the National Academy of Sciences 119 (4): e2108097119. Lee, Robert Y, Kevin S Li, James Sibley, et al.
Chunk 156
2025. “Assessment of a Zero-Shot Large Language Model
in Measuring Documented Goals-of-Care Discussions.” Journal of Pain and Symptom Management.
Chunk 157
Maletzke, André G., Denis M. dos Reis, Everton A.
Chunk 158
Cherman, and Gustavo E. A.
Chunk 159
P. A.
Chunk 160
Batista. 2018.
Chunk 161
“On
the Need of Class Ratio Insensitive Drift Tests for Data Streams.” Proceedings of the 2nd International
Workshop on Learning with Imbalanced Domains: Theory and Applications (LIDTA), 110–24. Mellon, Jonathan, Jack Bailey, Ralph Scott, James Breckwoldt, Marta Miori, and Phillip Schmedeman.
Chunk 162
2024. “Do AIs Know What the Most Important Issue Is?
Chunk 163
Using Language Models to Code Open-Text
Social Survey Responses at Scale.” Research & Politics 11 (1): 1–7. Oliveira, Rodrigo de, Matthew Garber, James M.
Chunk 164
Gwinnutt, et al. 2025.
Chunk 165
“A Study of Calibration
as a Measurement of Trustworthiness of Large Language Models in Biomedical Natural Language
Processing.” JAMIA Open 8 (4). Overos, Henry David, Roman Hlatky, Ojashwi Pathak, et al.
Chunk 166
2024. “Coding with the Machines: Machine-
Assisted Coding of Rare Event Data.” PNAS Nexus 3 (5): pgae165.
Chunk 167
Ren, Kevin, Santiago Cortes-Gomez, Carlos Miguel Patiño, et al. 2025.
Chunk 168
“Predicting Language Models’ Suc-
cess at Zero-Shot Probabilistic Prediction.” Findings of the Association for Computational Linguistics:
EMNLP 2025, 18337–63. Rogan, Walter J., and Beth Gladen.
Chunk 169
1978. “Estimating Prevalence from the Results of a Screening Test.”
American Journal of Epidemiology 107 (1): 71–76.
Chunk 170
Saerens, Marco, Patrice Latinne, and Christine Decaestecker. 2002.
Chunk 171
“Adjusting the Outputs of a Classifier
to New a Priori Probabilities: A Simple Procedure.” Neural Computation 14 (1): 21–41. Storkey, Amos.
Chunk 172
2009. “When Training and Test Sets Are Different: Characterizing Learning Transfer.” In
Dataset Shift in Machine Learning.
Chunk 173
MIT Press. 12
--- Page 13 ---
Sushil, Madhumita, Travis Zack, Divneet Mandair, et al.
Chunk 174
2024. “A Comparative Study of Large Language
Model-Based Zero-Shot Inference and Task-Specific Supervised Classification of Breast Cancer Pathology
Reports.” Journal of the American Medical Informatics Association 31 (10): 2315–27.
Chunk 175
Tax, Niek, Lorenzo Perini, Fridolin Linder, et al. 2026.
Chunk 176
“MCGrad: Multicalibration at Web Scale.”
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Tian, Katherine, Eric Mitchell, Allan Zhou, et al.
Chunk 177
2023. “Just Ask for Calibration: Strategies for Eliciting
Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback.” Proceedings
of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Chunk 178
Tojima, Tatsuya, and Mitsuo Yoshida. 2025.
Chunk 179
“Zero-Shot Classification of Art with Large Language
Models.” IEEE Access 13: 17426–39. Wang, Xuezhi, Jason Wei, Dale Schuurmans, et al.
Chunk 180
2023. “Self-Consistency Improves Chain of Thought
Reasoning in Language Models.” Proceedings of the 11th International Conference on Learning Repre-
sentations (ICLR).
Chunk 181
Weidmann, Nils B., Mats Faulborn, and David Garcia. 2026.
Chunk 182
“Large Language Models Are Democracy
Coders with Attitudes.” PS: Political Science & Politics 59 (1): 17–23. Wu, Siqi, and Paul Resnick.
Chunk 183
2024. “Calibrate-Extrapolate: Rethinking Prevalence Estimation with Black
Box Classifiers.” Proceedings of the International AAAI Conference on Web and Social Media (ICWSM)
18: 1634–47.
Chunk 184
Yang, Yahan, Soham Dan, Dan Roth, and Insup Lee. 2023.
Chunk 185
“On the Calibration of Multilingual Question
Answering LLMs.” arXiv Preprint arXiv:2311.08669. Yuan, Weizhe, Richard Yuanzhe Pang, Kyunghyun Cho, et al.
Chunk 186
2024. “Self-Rewarding Language Models.”
Proceedings of the 41st International Conference on Machine Learning (ICML).
Chunk 187
Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. 2023.
Chunk 188
“Judging LLM-as-a-Judge with MT-Bench and
Chatbot Arena.” Advances in Neural Information Processing Systems (NeurIPS). 13
--- Page 14 ---
SI Appendix: Unbiased Prevalence Estimation with Multicalibrated
LLMs
SI Appendix
S1.
Chunk 189
Formal Definitions of Prevalence Estimation Methods
This section provides full mathematical definitions of the seven prevalence estimation methods
compared in the simulation study. Setup.
Chunk 190
Let h(X) ∈[0, 1] denote the device’s probabilistic prediction for input X, with true label
Y ∈{0, 1}. The goal is to estimate the target prevalence π∗= P ∗(Y = 1) using only unlabeled
target data {X∗
i }n
i=1 and calibration parameters estimated from a labeled source dataset.
Chunk 191
S1.1 Uncalibrated Averaging
ˆπraw = 1
n
n
X
i=1
h(X∗
i )
S1.2 Classify & Count
Given a threshold τ chosen on calibration data:
ˆπCC = 1
n
n
X
i=1
1[h(X∗
i ) ≥τ]
In the simulation, τ is chosen so that ˆπCC matches the true prevalence on the calibration set. S1.3 Rogan-Gladen (Adjusted Count)
ˆπRG = ˆπCC −[
FPR
[
TPR −[
FPR
where [
TPR and [
FPR are estimated from calibration data at threshold τ (Rogan and Gladen 1978).
Chunk 192
1
--- Page 15 ---
S1.4 Probabilistic Adjusted Classify & Count (PACC)
ˆπPACC =
¯h −ˆµ0
ˆµ1 −ˆµ0
where ¯h = 1
n
P
i h(X∗
i ), ˆµ1 = ˆE[h(X)|Y = 1], and ˆµ0 = ˆE[h(X)|Y = 0] are estimated from calibration
data (González et al. 2017).
Chunk 193
S1.5 SLD (EMQ)
The Saerens-Latinne-Decaestecker algorithm iterates:
1. Initialize ˆπ(0) from the source prevalence.
Chunk 194
2. E-step: Adjust posteriors for the new prior:
˜h(t)
i
=
(ˆπ(t)/πs) · h(X∗
i )
(ˆπ(t)/πs) · h(X∗
i ) + ((1 −ˆπ(t))/(1 −πs)) · (1 −h(X∗
i ))
3.
Chunk 195
M-step: ˆπ(t+1) = 1
n
P
i ˜h(t)
i
4. Repeat until convergence (Saerens et al.
Chunk 196
2002). S1.6 Global Calibration
In the simulation, global calibration applies a multiplicative correction:
hcal(X) = c · h(X)
where c = ¯Ycal/¯hcal is estimated on calibration data.
Chunk 197
In the empirical applications, global calibration
uses isotonic regression. S1.7 Multicalibration
In the simulation, which has a single binary covariate, multicalibration reduces to stratum-specific
additive corrections:
hmc(X) = h(X) + ˆϵg
for X ∈stratum g
where ˆϵg = ¯Yg −¯hg is estimated on calibration data within each stratum.
Chunk 198
In the empirical applications, we use MCGrad (Tax et al. 2026), a multicalibration algorithm based
on gradient boosting.
Chunk 199
MCGrad operates in logit space: given a base predictor f0(X) with logit
F0(X) = logit(f0(X)), it iteratively fits gradient boosted decision trees (GBDTs) on the residuals
between labels and current predictions. At each round t, a GBDT gt is trained with the current
logit predictions as init_score and with the feature matrix consisting of the segment features
(categorical and numerical) augmented by the current logit prediction as an additional input feature.
Chunk 200
The logit predictor is then updated as Ft+1(X) = αt · (Ft(X) + gt(X)), where αt is an unshrinkage
factor estimated by logistic regression to counteract the GBDT’s learning rate. By including the
prediction as a feature, GBDT splits naturally discover miscalibrated regions in the joint space of
features and score levels, thereby approximating multicalibration without requiring explicit group
specification.
Chunk 201
Early stopping on a validation set prevents overfitting. MCGrad uses LightGBM as
the GBDT implementation.
Chunk 202
See Tax et al. (2026) for convergence results and deployment details.
Chunk 203
2
--- Page 16 ---
S2. Robustness: Replication with Open-Weight LLM (Llama 3.3 70B)
The main text reports results using Claude Opus 4.6 as the LLM measurement device.
Chunk 204
To verify that
the findings are not specific to a particular model, we replicate the CAP analysis using Llama 3.3
70B Instruct (Grattafiori et al. 2024) (4-bit NF4 quantized, run on a single A100 80GB GPU).
Chunk 205
This
section reports results using two score extraction methods: token log-probabilities and verbalized
confidence elicitation. S2.1 Score Extraction Methods
Log-probabilities.
Chunk 206
For each document, the model is prompted with the CAP codebook definition
of Law & Crime and asked to respond Yes or No. The score is extracted from next-token log-
probabilities: h(X) = P(Yes)/(P(Yes) + P(No)).
Chunk 207
This produces highly bimodal scores: 23% of
the 105,000 documents score at exactly 0.0 or 1.0, and only 6% fall in the mid-range [0.1, 0.9]. Because MCGrad’s internal logit transform maps values near 0 and 1 to ±∞, a linear squashing
transformation h′(X) = ϵ + (1 −2ϵ) · h(X) with ϵ = 0.05 is applied before fitting MCGrad.
Chunk 208
Verbalized confidence (2-stage). A two-stage dialogue first asks the model to classify the
document (Yes/No), then asks it to estimate the probability that its answer is correct, with an
anti-certainty instruction (“Note: very few things are 0% or 100% certain”) to discourage degenerate
outputs (Tian et al.
Chunk 209
2023). The score is P(correct) if the answer is Yes and 1 −P(correct) if
No.
Chunk 210
This produces scores in [0.01, 0.99] with negligible boundary mass and 11 unique score values. No squashing is required.
Chunk 211
S2.2 Data and Calibration
The Llama analysis uses the full 105,000-document sample (15,000 per sub-population for Denmark
questions, Spain questions, U.S. bills, and Belgium newspaper; 30,000 for Spanish media; 15,000
for Belgian TV).
Chunk 212
The calibration set (n ≈40,000) is drawn equally from the four in-distribution
sub-populations. MCGrad is calibrated with categorical features (country, document type, party)
and one numerical feature (decade).
Chunk 213
S2.3 Results: Verbalized Confidence Scores
Table S3 shows prevalence estimation bias using Llama 3.3 70B with verbalized confidence scores. Scenario
Shift
Type
True
Prev.
Chunk 214
CC
RG
IPW
Iso. MCGrad
Baseline
None
8.1%
+14.7
+0.6
+0.1
+0.2
+0.2
Country
shift
Within-
cal.
Chunk 215
8.7%
+15.6
+2.2
-0.3
+1.3
+0.1
Doc-type
shift
Within-
cal. 6.5%
+16.0
+1.7
+0.0
+0.7
+0.1
Spain
media
OOD doc
type
19.3%
+15.4
+6.6
-9.8
-7.2
-4.9
3
--- Page 17 ---
Scenario
Shift
Type
True
Prev.
Chunk 216
CC
RG
IPW
Iso. MCGrad
Belgium
TV
OOD doc
type
11.1%
+13.3
-0.0
-3.5
-1.6
-3.4
Table S3: Prevalence estimation bias (pp) for Law & Crime topic using Llama 3.3 70B with verbalized
confidence scores.
Chunk 217
CC = Classify & Count, RG = Rogan-Gladen, IPW = importance-weighted
estimation, Iso. = isotonic regression.
Chunk 218
The pattern is consistent with the main text’s Claude Opus results: MCGrad achieves near-zero bias
within the calibration distribution (≤0.2pp) and degrades on OOD populations (-3.4 to -4.9pp). Several differences are notable:
• Higher raw CC bias (+14-16pp vs.
Chunk 219
+2-5pp with Opus). • Comparable MCGrad within-calibration performance (≤0.2pp for both models),
confirming that multicalibration corrects for model-specific calibration errors.
Chunk 220
• Larger OOD bias on Spanish media (-4.9pp vs. -2.5pp with Opus binary labels), reflecting
the combination of a weaker base model with the coarser verbalized score distribution (11
unique values vs.
Chunk 221
Opus’s 43). S2.4 Score Distribution: Log-Probabilities vs.
Chunk 222
Verbalized Confidence
The bimodal distribution of Llama’s log-probability scores illustrates a broader challenge for LLM-
based measurement. RLHF-tuned instruction-following models tend to produce highly confident
outputs, pushing token probabilities toward 0 or 1.
Chunk 223
This creates two problems for prevalence
estimation: (1) the scores carry little information about uncertainty, producing large raw bias even
at baseline (+18pp), and (2) post-hoc calibration methods that operate in logit space (including
MCGrad) require score preprocessing to avoid numerical instability. Verbalized confidence elicitation partially addresses both problems by producing scores that are
better distributed (75% in [0.1, 0.9]) and better calibrated out of the box (log loss 0.525 vs.
Chunk 224
1.707
for log-probabilities). However, the scores remain coarsely discretized (11 unique values), and as
shown in both the Llama and Opus analyses, the quality of the input scores matters less than the
metadata features for MCGrad’s prevalence estimation performance under shift.
Chunk 225
S2.5 Additional Baselines: SLD and PACC on Llama Scores
The SLD (EMQ) algorithm, designed for label shift rather than covariate shift, diverges catastroph-
ically on Llama’s verbalized confidence scores, producing prevalence estimates biased by +33 to
+60pp. This occurs because the verbalized scores are not calibrated posteriors, violating SLD’s core
assumption.
Chunk 226
PACC shows moderate bias (+0.6 to +5.9pp within calibration, +2.4 to +5.9pp OOD). Full results including SLD and PACC are available in the replication code.
Chunk 227
S3. Detailed Results Tables
Table S1: ACS Employment Prevalence Estimation Bias
4
--- Page 18 ---
Setting
Age Dist.
Chunk 228
True
Prev. Raw
CC
RG
PACC
SLD
IPW
Iso.
Chunk 229
MCGrad
In-
Dist
Original
46.0%
-0.31
-0.07
+0.26
-0.07
+0.01
-0.3
-0.30
-0.27
In-
Dist
Young-
skewed
12.8%
+1.93
+2.47
-12.82
-12.82
-11.95
-0.3
+2.04
-0.11
In-
Dist
Old-
skewed
16.8%
+7.23
-6.62
-16.77
-16.77
-16.76
-1.2
+6.65
+0.22
In-
Dist
Bimodal
21.1%
+4.57
-1.50
-18.62
-19.97
-16.14
+4.7
+4.33
+0.12
OOD
Original
45.1%
+1.15
+1.40
+2.12
+2.13
+2.25
+1.7
+1.17
+1.35
OOD
Young-
skewed
13.0%
+2.93
+3.73
-12.96
-12.96
-11.38
+0.8
+3.08
+0.88
OOD
Old-
skewed
16.0%
+8.47
-5.64
-15.97
-15.97
-15.97
+0.1
+7.91
+1.01
OOD
Bimodal
20.8%
+5.91
+0.09
-16.14
-17.27
-15.20
+6.3
+5.69
+1.13
Prevalence estimation bias in percentage points (pp) under synthetic age distribution shift. Raw =
uncalibrated averaging, CC = Classify & Count, RG = Rogan-Gladen, IPW = importance-weighted
prevalence estimation, Iso.
Chunk 230
= Isotonic regression. Bootstrap RMSE (200 iterations) closely tracks
absolute bias in all scenarios.
Chunk 231
Table S2: CAP Law & Crime Prevalence Estimation Bias (Claude Opus 4.6)
Scenario
Shift
Type
True
Prev. CC
RG
SLD
IPW
Iso.
Chunk 232
MC (bi-
nary)
MC
(scores)
Baseline
None
7.9%
+2.2
+0.5
+7.4
+0.1
+0.1
+0.1
+0.2
Country
shift
Within-
cal. 8.4%
+3.3
+1.7
+9.5
+0.1
+0.9
+0.4
+0.4
Doc-
type
shift
Within-
cal.
Chunk 233
6.3%
+1.6
-0.3
+4.9
+0.1
+0.0
-0.0
+0.1
Spain
media
OOD
doc
type
19.5%
+3.6
+3.1
+21.6
-12.2
-2.5
-1.9
-4.5
Belgium
TV
OOD
doc
type
11.1%
+4.8
+3.7
+13.7
-4.5
+2.6
+0.7
+1.6
CC = Classify & Count (fraction of Yes labels); RG = Rogan-Gladen adjustment on binary labels;
SLD = Saerens-Latinne-Decaestecker (label shift, applied to probability scores); IPW = importance-
weighted estimation (target-specific density ratio); Iso. = isotonic regression on probability scores;
MC (binary) = MCGrad on binary labels with base-rate initialization; MC (scores) = MCGrad on
probability scores.
Chunk 234
5
--- Page 19 ---
S4. Simulation: RMSE
Figure S2: Root mean squared error (RMSE) under covariate shift for the same four methods shown
in Figure 1, averaged over 50 simulation runs.
Chunk 235
RMSE closely tracks absolute bias for all methods,
confirming that variance is small relative to bias at this sample size. MCGrad maintains the lowest
RMSE across all shift levels.
Chunk 236
S5. Simulation: All Methods
Figure S3: Simulation bias curves for all seven methods.
Chunk 237
Rogan-Gladen and PACC exhibit catas-
trophic failure (bias exceeding -200% at extreme shifts). SLD shows large bias under covariate shift
because it assumes label shift.
Chunk 238
Uncalibrated averaging shows moderate bias. MCGrad maintains
near-zero bias throughout.
Chunk 239
6
--- Page 20 ---
S6. Claude Opus Score Distribution
Figure S4: Claude Opus 4.6 P(Yes) score distribution by label across six CAP sub-populations.
Chunk 240
Scores are well-separated (mean 0.75 for positives vs. 0.07 for negatives) with 43 unique values and
no boundary mass.
Chunk 241
References
González, Pablo, Alberto Castaño, Nitesh V. Chawla, and Juan José Del Coz.
Chunk 242
2017. “A Review on
Quantification Learning.” ACM Computing Surveys 50 (5): 1–40.
Chunk 243
Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024.
Chunk 244
“The Llama 3 Herd of Models.”
arXiv Preprint arXiv:2407.21783. Rogan, Walter J., and Beth Gladen.
Chunk 245
1978. “Estimating Prevalence from the Results of a Screening
Test.” American Journal of Epidemiology 107 (1): 71–76.
Chunk 246
Saerens, Marco, Patrice Latinne, and Christine Decaestecker. 2002.
Chunk 247
“Adjusting the Outputs of
a Classifier to New a Priori Probabilities: A Simple Procedure.” Neural Computation 14 (1):
21–41. Tax, Niek, Lorenzo Perini, Fridolin Linder, et al.
Chunk 248
2026. “MCGrad: Multicalibration at Web Scale.”
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD).
Chunk 249
Tian, Katherine, Eric Mitchell, Allan Zhou, et al. 2023.
Chunk 250
“Just Ask for Calibration: Strategies
7
--- Page 21 ---
for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human
Feedback.” Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing (EMNLP). 8