Document Asssistant

Research Papers

Unbiased Prevalence Estimation With Multicalibrated Llms

Document ID: research-papers-unbiased-prevalence-estimation-with-multicalibrated-llms

Full content

--- Page 1 --- Unbiased Prevalence Estimation with Multicalibrated LLMs Fridolin Linder1 Thomas Leeper1 Daniel Haimovich1 Niek Tax1 Lorenzo Perini1 Milan Vojnovic1,2 1Meta Platforms Inc., 2The London School of Economics and Political Science Corresponding author: Fridolin Linder (flinder@meta.com) Classification: Social Sciences / Political Science; Physical Sciences / Statistics Keywords: multicalibration | large language models | prevalence estimation | covariate shift | quantification 1 --- Page 2 --- Significance Large language models are increasingly used as measurement devices to estimate prevalence in populations. A critical but overlooked problem arises when the target population differs from the validation population: standard methods produce biased prevalence estimates, even when the model achieves high classification accuracy. We show that multicalibration, requiring a device to be accurate conditional on input fea- tures—rather than just on average—is sufficient for unbiased prevalence estimation under covariate shift. Our theoretical and empirical results imply that the rapidly growing body of LLM-based measurement research is vulnerable to systematic bias that can be mitigated by enforcing multicalibration. Abstract Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic tests, classifiers, or large language models) is fundamental to science, public health, and online trust and safety. Standard approaches correct for known device error rates but assume these rates remain stable across populations. We show this assumption fails under covariate shift and that multicalibration, which enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this guarantee. Our work connects recent theoretical work on fairness to a longstanding measurement problem spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias growing with shift magnitude, while a multicalibrated estimator maintains near-zero bias. While we focus the discussion mostly on LLMs, our theoretical results apply to any classification model. Two empirical applications—estimating employment prevalence across U.S. states using the American Community Survey, and classifying political texts across four countries using an LLM—demonstrate that multicalibration substantially reduces bias in practice, while highlighting that calibration data should cover the key feature dimensions along which target populations may differ. Introduction Large language models (LLMs) are increasingly used as measurement devices for estimating the prevalence of a category in a population. Researchers now routinely deploy LLMs as zero-shot classifiers to estimate the frequency of phenomena that previously required expensive manual annotation: coding democracy indicators across countries (Weidmann et al. 2026), classifying protest events in news corpora (Overos et al. 2024), estimating party policy positions from manifestos across dozens of countries and languages (Benoit et al. 2026), categorizing open-ended survey responses at near-human accuracy (Mellon et al. 2024; Gilardi et al. 2023), extracting diagnostic attributes from pathology reports (Sushil et al. 2024), identifying goals-of-care discussions in clinical notes (Lee et al. 2025), annotating art forms in auction records (Tojima and Yoshida 2025), and converting qualitative text into quantitative variables across multiple languages (Karjus 2025). LLMs are also being deployed for content moderation1 and as LLM “judges” to assess the quality and safety of AI systems (Zheng et al. 2023; Yuan et al. 2024). These applications are typically validated by reporting discriminative performance: accuracy, AUC, or agreement with human annotations. Researchers then use or advocate for using validated LLMs as measurement devices across new populations, time periods, or subgroups. But strong discriminative performance does not guarantee accurate prevalence estimates from the classifications provided by these devices. When the composition of the target population differs from the validation population, a model that separates positives from negatives well can still produce prevalence estimates that are substantially 1https://openai.com/index/using-gpt-4-for-content-moderation/ 2 --- Page 3 --- biased, because discrimination is invariant to the calibration errors that drive prevalence bias. This problem is often missed in standard validation practice and, as we show, can produce large bias even with strong classification performance. Neither the confidence elicitation literature nor the quantification2 literature solves this problem. Confidence elicitation methods — verbalized confidence (Tian et al. 2023), token log-probabilities (Kadavath et al. 2022), consistency sampling (Wang et al. 2023) — replace the LLM’s bare binary classification with a probability score, and post-hoc calibration (Oliveira et al. 2025) can further refine those scores. Both target global calibration: calibration on the population where they are evaluated. Empirical evidence confirms that such calibration does not transfer: it degrades 2–3× under language shift even when accuracy is preserved (Yang et al. 2023), and varies widely across tasks and domains (Ren et al. 2025). Quantification methods such as “Classify & Count”, “Adjusted Count”, and the Saerens-Latinne-Decaestecker (SLD) EM algorithm (see González et al. (2017) for a detailed survey) attempt to correct for classifier errors directly, but rely on the assumption that error properties remain static across populations. Importance-weighted methods handle covariate shift in principle but require re-estimating density ratios for each new target population. None of these approaches provides a measurement device that can be validated once and then reliably applied to new populations. We show that multicalibration — calibration conditional on the input features, not just on average — fills this gap. Building on Kim et al. (2022)‘s “Universal Adaptability” result, we show that a multicalibrated device requires no target-specific estimation: it is calibrated once on source data and produces unbiased prevalence estimates on any target population whose features lie within the calibration support. This property is critical for LLM-based measurement, where a single device is typically applied to many populations without the opportunity to re-calibrate for each one. Multicalibration can be applied to any LLM output — from discrete Yes/No classifications (the standard practice in applied research) to continuous confidence scores — because it operates on the scores’ relationship to outcomes conditional on features, not on the scores’ intrinsic quality. Our results apply more generally to any model-based measurement device, but the LLM setting is where the practical need is most urgent and the calibration problem most overlooked. We review existing approaches to the quantification problem, connect the quantification task to the broader theory of domain adaptation via multicalibration, and apply those insights to two empirical applications: a controlled demonstration using a standard ML classifier to estimate employment prevalence under age distribution shift (American Community Survey), followed by the main application of classifying political texts across four countries using an LLM as a zero-shot measurement device. Results Standard calibration methods fail under covariate shift Consider a binary outcome Y ∈{0, 1}, features X, and a device h(X) ∈[0, 1] producing probabilistic predictions. The goal is to estimate population prevalence π = P(Y = 1) in a target population using only unlabeled target data {Xi} and the device h. We focus on covariate shift (Storkey 2009), where P(X) changes across populations but P(Y | X) remains stable. This is the natural assumption when features causally influence outcomes (X →Y ), as in many measurement applications. See Limitations for discussion of label shift (Y →X) and concept drift. A device is globally calibrated if E[Y | h(X) = p] = p for all prediction values p. Under global calibration, E[h(X)] = E[Y ] = π, so the sample mean of predictions is an unbiased prevalence estimate. However, 2The task of estimating prevalence from imperfect classifiers is known as quantification in the machine learning literature; we use this term interchangeably with prevalence estimation. 3 --- Page 4 --- global calibration is a property of a specific population. A device calibrated on one population need not be calibrated on another, even under covariate shift with stable P(Y | X). The failure mechanism is as follows. Suppose the population consists of subgroups G with weights wG. Within each subgroup, the device may be biased: let ϵG = E[Y −h(X) | X ∈G] denote the mean prediction error within group G. Global calibration requires only that these errors cancel on average: P G wGϵG = 0. Under covariate shift, the group weights change to w∗ G, and the bias in the prevalence estimate becomes P G w∗ GϵG, which is generally nonzero unless ϵG = 0 for all G. A device can be perfectly globally calibrated, with errors that precisely cancel in the training population, while having nonzero mean prediction error within every individual subgroup. Standard quantification methods are each vulnerable to this failure mechanism (González et al. 2017). Classify & Count, Rogan-Gladen adjustment (Rogan and Gladen 1978), and Probabilistic Adjusted Classify & Count (PACC) all rely on error rates or conditional score means estimated from calibration data; these are weighted averages over subgroups and shift with population composition. The SLD/EMQ algorithm (Saerens et al. 2002) and more recent distribution-matching methods such as DyS and HDy (Maletzke et al. 2018) assume label shift (P(X | Y ) stable) and fail under covariate shift. Even global calibration via isotonic regression, recommended by Wu and Resnick (2024) for covariate shift settings, produces biased prevalence estimates because global calibration does not guarantee feature-conditional accuracy. Multicalibration guarantees unbiased prevalence estimation The analysis above shows that all standard methods produce biased prevalence estimates under covariate shift, with bias P G w∗ GϵG that is generally nonzero unless every ϵG = 0. Multicalibration (Hébert-Johnson et al. 2018) formalizes this requirement. A predictor f(X) is multical- ibrated with respect to a collection of subgroups G if E[Y | f(X) = v, X ∈G] = v for every G ∈G and prediction value v. When G is rich enough to capture all relevant structure in X, this is equivalent to requiring calibration conditional on the full feature space: E[Y | f(X) = v, X = x] = v for all x and v with sufficient probability mass. This is the framing of Kim et al. (2022)’s “Universal Adaptability”: a predictor calibrated conditional on X yields correct expected values under any reweighting of P(X), without requiring knowledge of the shift. The connection to prevalence estimation is direct. If f is calibrated conditional on X, then E[f(X) | X = x] = E[Y | X = x] for all x. Under covariate shift, where only P(X) changes while P(Y | X) remains stable, the law of iterated expectations gives (where E∗and P ∗denote expectations and probabilities under the target distribution, and π∗= P ∗(Y =1) is the target prevalence): E∗[f(X)] = Z E[f(X) | X = x] dP ∗(x) = Z E[Y | X = x] dP ∗(x) = E∗[Y ] = π∗. Because predictions are correct at each point in the feature space, the average of predictions tracks the true prevalence under any change in P(X). The condition strictly necessary for this robustness is multi-accuracy: E[f(X)−Y | X ∈G] = 0 for all groups G that the shift can reweight. Multicalibration, which additionally requires E[Y | f(X) = v, X ∈G] = v, is strictly stronger and implies multi-accuracy. Multicalibration is the preferable target for three reasons. First, practical post-hoc algorithms like MCGrad (Tax et al. 2026)3 naturally produce multicalibrated 3An open-source implementation is available at https://mcgrad.dev. 4 --- Page 5 --- predictors (Hébert-Johnson et al. 2018), so the stronger guarantee comes at no additional cost. Second, multicalibration covers a broader set of use cases where multi-accuracy no longer suffices and full calibration conditional on both features and score level is required: when the population shift is mediated by the device’s scores (e.g., using the device’s scores to decide which items to label, or threshold-based filtering), when prevalence is estimated within score strata, or when scores are used as inputs to downstream regressions. Third, multicalibration provides robustness against misspecification of which features drive the shift, since a predictor multicalibrated with respect to a rich class of subgroups is automatically multi-accurate for any sub-partition. Both conditions require that the calibration features capture the dimensions along which the population shift occurs (an ignorability assumption). Simulation: standard methods fail, multicalibration succeeds We illustrate the theoretical results above with a simulation. We generate data with a binary covariate X ∈{0, 1} and binary outcome Y , where P(Y = 1 | X = 1) = 0.85 and P(Y = 1 | X = 0) = 0.15. We simulate a classifier with hardcoded, systematically biased predictions: 10% underestimation when X = 0 and 10% overestimation when X = 1. All calibration parameters are learned on a balanced training distribution (P(X = 0) = 0.5). Prevalence is estimated by averaging the predicted probabilities over the target sample. We then evaluate prevalence estimates on test distributions where P(X = 0) ranges from 0.01 to 0.99, repeating 50 times (details in Materials and Methods). Figure 1: Prevalence estimation bias (% relative) under covariate shift, averaged over 50 simulation runs. The x-axis shows ∆P(X =0), the change in P(X =0) from the training value of 0.5. Classify & Count and Rogan-Gladen diverge with increasing shift; isotonic regression (global calibration) shows moderate bias; MCGrad maintains near-zero bias across all shift levels. Classify & Count and Rogan-Gladen curves are cropped at the ±40% axis limits; their bias continues to grow beyond this range, exceeding ±100% at extreme shifts (see SI Appendix Figure S3 for full range). Additional methods (PACC, SLD, uncalibrated averaging) also shown in SI Appendix Figure S3. Figure 1 shows the results. At the training distribution (center), all methods produce approximately unbiased estimates. As the distribution shifts, the methods diverge. Rogan-Gladen is particularly unstable: its ratio structure amplifies estimation errors, with bias exceeding ±40% at extreme shifts. Classify & Count shows growing bias in the same direction. Isotonic regression (global calibration) shows more 5 --- Page 6 --- moderate but still substantial bias (up to 15% at extreme shifts). SLD and PACC exhibit comparably large failures (SI Appendix Figure S3). The multicalibrated estimator maintains near-zero bias and the lowest RMSE across the entire range of distribution shifts (SI Appendix Figure S2), confirming that the bias reduction does not come at the cost of increased variance. Empirical application: employment prevalence under age distribution shift Before applying multicalibration to LLM-generated scores, which present additional challenges such as coarse discretization and non-standard score distributions, we first demonstrate the core mechanism in a controlled setting with a traditional machine learning classifier. We analyze data from the American Community Survey (ACS), a large-scale annual survey conducted by the U.S. Census Bureau. The concept to be measured is the rate of employment. The measurement device is a logistic regression model of binary employment status, with 16 sociodemographic features including age, education, marital status, disability status, and citizenship. Setup. We train a logistic regression classifier on data from eight U.S. states (TX, MI, PA, OH, IL, GA, NC, VA) across 2016–2018, totaling approximately 1.5 million training observations. The remaining in-distribution data is split into a calibration set (n ≈644,000) and a test set (n ≈920,000). Six additional states (CA, NY, FL, WA, AZ, CO) are held out for out-of-distribution (OOD) evaluation. We fit two post-hoc calibration methods on the calibration set: isotonic regression (global calibration) and MCGrad, a multicalibration algorithm that enforces calibration conditional on both categorical and numerical features. We compare five prevalence estimation methods (Figure 2): Classify & Count with a prevalence-matched threshold, Rogan-Gladen adjustment, importance-weighted estimation (IPW), isotonic regression, and MCGrad. Additional methods (PACC, SLD, uncalibrated averaging) are reported in SI Appendix Table S1. Age distribution shift. Employment rates vary dramatically by age: approximately 47% for ages 16–24, 76% for ages 25–54, 61% for ages 55–64, and 17% for ages 65+. This makes age an ideal dimension along which to construct meaningful distribution shifts. We create synthetic target populations by resampling test data with shifted age distributions: young-skewed (oversampling ages 16–30), old-skewed (oversampling ages 60+), and bimodal (oversampling both tails). The resulting populations have true employment rates ranging from 12.8% to 46.0%. All calibration parameters are estimated once on the original calibration set and held fixed across scenarios. Figure 2: Prevalence estimation |bias| (percentage points) under synthetic age distribution shift, for in-distribution states (left) and out-of-distribution states (right). Each marker shape represents a different age shift scenario; horizontal lines show the mean across scenarios. Full numerical results in SI Appendix Table S1. 6 --- Page 7 --- Results. With no age shift, all methods produce approximately unbiased estimates (Figure 2). Under shift, the methods diverge sharply. Rogan-Gladen fails severely (12–19pp bias under age shift). SLD, designed for label shift rather than covariate shift, shows comparably large bias (SI Appendix Table S1). Classify & Count and isotonic regression show moderate but growing bias (up to 8pp under age shift). IPW performs well on simple shifts (≤1.2pp) but fails on the bimodal shift (+4.7pp in-distribution, +6.3pp OOD) where the density ratio is hard to model. MCGrad produces near-zero bias across all in-distribution scenarios (≤0.27pp), including the bimodal shift where IPW struggles. In the OOD setting (held-out states with age shift), MCGrad’s bias increases modestly (0.88–1.35pp), reflecting geographic shift along an uncalibrated dimension. Even so, MCGrad maintains the lowest bias and RMSE across all scenarios. LLM-based topic classification under cross-national shift The ACS application uses a standard machine learning classifier with well-distributed probability estimates. We now examine the setting that more immediately motivates this paper: using an LLM as a zero-shot measurement device across multiple countries and languages. We compare two LLM output modes that reflect current practice and the confidence elicitation literature, respectively: (1) discrete Yes/No classifications, which is how LLMs are used in all of the applied studies cited in the introduction, and (2) probability scores obtained via direct probability elicitation, where the LLM estimates P(Yes) and P(No) without first committing to an answer. This design lets us test whether confidence elicitation improves prevalence estimation, and whether multicalibration is needed in either case. Setup. We use Claude Opus 4.6 to classify 30,000 political texts from the Comparative Agendas Project (Baumgartner et al. 2006) as related to Law & Crime (CAP major topic code 12). The data comprises six sub-populations across four countries and four languages (5,000 documents each): Danish parliamentary questions, Spanish oral questions, U.S. congressional bills, Belgian newspaper articles, Spanish media articles, and Belgian TV news. Each document is classified twice in independent campaigns: once for a binary Yes/No label, and once for direct probability estimates P(Yes) and P(No) (summing to 1.0). The LLM achieves strong discriminative performance: AUC 0.960 for binary labels and 0.987 for probability scores on the in-distribution test set, with per-language AUCs of 0.983–0.994 for the probability scores. The calibration set (n ≈13,400) draws equally from four sub-populations (Denmark questions, Spain questions, U.S. bills, Belgium newspaper), ensuring that all four countries and languages are represented. Two sub-populations are held out as out-of-distribution targets: Spanish media and Belgian TV news, which share country and language with calibration data but introduce an unseen document type. MCGrad is calibrated with categorical features (country, document type, party) and two numerical features (decade, document length). For the binary-label condition, MCGrad receives the LLM’s Yes/No classification as a categorical input feature, with all initial scores set to the calibration-set base rate. MCGrad then learns feature-conditional prevalence estimates from the label and metadata alone, without any probability score from the LLM. For the probability-score condition, MCGrad receives the LLM’s P(Yes) directly as the input score and calibrates it conditional on the same metadata features. Results. Figure 3 shows prevalence estimation bias across five scenarios: a baseline with no shift, two within-calibration shifts (country composition, document type composition), and two OOD scenarios. 7 --- Page 8 --- Figure 3: Prevalence estimation |bias| (percentage points) across the shift gradient. Each marker shape represents a different shift scenario; horizontal lines show the mean across scenarios. MCGrad achieves near-zero bias within the calibration distribution in both binary-label and probability-score conditions. Full numerical results in SI Appendix Table S2. The failure pattern is clearly visible: Classify & Count, the standard practice of counting positive LLM classifications, produces bias of +1.6 to +4.8pp across all scenarios. The Rogan-Gladen adjustment reduces bias within calibration but still shows +3.1 to +3.7pp on OOD populations. Isotonic regression shows moderate bias within calibration (≤0.9pp) but larger OOD bias (2.5–2.6pp). SLD, designed for label shift, shows very large bias (+5 to +22pp; SI Appendix Table S2). MCGrad on binary labels achieves near-zero bias within the calibration distribution (≤0.4pp) and degrades gracefully on OOD targets: +0.7pp on Belgian TV and -1.9pp on Spanish media. This is achieved using only the LLM’s Yes/No classification and document metadata, with no probability scores. MCGrad on probability scores achieves comparable within-calibration performance (≤0.4pp) but shows larger OOD bias on Spanish media (-4.5pp) and Belgian TV (+1.6pp). The finding that binary labels with MCGrad can match or outperform probability scores with MCGrad suggests that the metadata features, not the input score quality, drive the calibration improvement. IPW, the standard covariate-shift method, achieves near-zero within-calibration bias but fails severely on OOD populations (-12.2pp on Spanish media) because the target contains feature values absent from the source (a positivity violation). Unlike MCGrad, IPW also requires re-estimating density ratios for each target population. Comparison with the ACS application. Across both applications, the pattern is consistent: MCGrad achieves near-zero bias when the target population’s features are within the calibration support, and degrades when the shift is along an uncalibrated dimension. IPW matches MCGrad within calibration but fails more severely on OOD targets. MCGrad’s practical advantage is that it requires no target-specific estimation: a single calibrated device can be applied to any target population. This is the “universal adaptability” property of Kim et al. (2022). We replicate the CAP analysis using Llama 3.3 70B Instruct (an open-weight model) in SI Appendix S2; the results are consistent, confirming that the findings are not specific to a particular LLM. Discussion We have shown that multicalibration—calibration conditional on the input features rather than just on average—is sufficient for accurate model-based prevalence estimation under population shift. The minimal theoretical requirement is multi-accuracy (correct predictions on average within each subgroup), 8 --- Page 9 --- which multicalibration implies. Both conditions require that the calibration features capture the relevant dimensions of population shift, and both require that the target population’s features lie within the support of the calibration distribution. Both empirical applications confirm this: when calibrated features overlap with the target population, MCGrad achieves near-zero bias; when the target introduces a novel feature value, bias increases with the severity of the shift. Practitioners should ensure their calibration data covers the key dimensions along which target populations may differ. A central practical advantage of multicalibration is that it produces a target-independent measurement device. Unlike importance-weighted methods, which require re-estimating density ratios for each new target and fail under positivity violations (IPW: -12.2pp on CAP Spanish media, vs. MCGrad: -1.9pp), a multicalibrated device is calibrated once and deployed without target-specific re-estimation. This is the “universal adaptability” property of Kim et al. (2022). Our CAP results further show that MCGrad on discrete binary labels achieves comparable or better prevalence estimation than MCGrad on continuous probability scores, suggesting that metadata features matter more than input score quality for prevalence estimation under shift. This has immediate practical implications: researchers using the standard workflow of prompting an LLM for Yes/No classifications can apply multicalibration directly to those labels, without needing confidence elicitation. Multicalibration does not require access to a model’s internals: observable metadata (document source, language, text length) can serve as segment features (Detommaso et al. 2024). Researchers who currently validate by reporting accuracy or AUC (Grimmer et al. 2022) should be aware that these metrics provide no information about the calibration errors that drive prevalence bias. Several limitations warrant discussion. First, the guarantee holds only for shifts along calibrated dimensions; out-of-domain performance degrades when target populations introduce novel feature values. Second, multicalibration requires labeled calibration data of sufficient size. Our two applications show MCGrad performing well with both large (644K, ACS) and moderate (13.4K, CAP) calibration sets, but the minimum required depends on the complexity of the feature space. In zero-shot LLM settings, calibration labels may require the manual annotation LLMs were intended to avoid; prediction-powered inference (Angelopoulos et al. 2023) offers a complementary framework. Third, our framework assumes covariate shift (P(Y | X) stable); under concept drift, no purely statistical correction substitutes for new labeled data. Finally, both applications use settings where ground-truth labels enable direct bias measurement; in practice, practitioners may lack such labels. These results connect two literatures that have developed in isolation. The quantification literature has focused on correcting aggregate error rates but has not engaged with feature-conditional calibration (González et al. 2017; Wu and Resnick 2024). The multicalibration literature has focused on individual-level prediction quality and fairness but has not emphasized the implications for population-level inference (Hébert-Johnson et al. 2018; Kim et al. 2022). Our contribution is to show that calibration conditional on the feature space is what makes LLM-based prevalence estimation reliable under the distribution shifts that motivate its use. Materials and Methods Simulation. We generate n = 10,000 observations with a binary covariate X ∈{0, 1} and outcome Y ∼Bernoulli(P(Y | X)), where P(Y = 1 | X = 1) = 0.85 and P(Y = 1 | X = 0) = 0.15. The classifier produces deterministic scores ˆp(X = 0) = 0.135 and ˆp(X = 1) = 0.935, representing 10% multiplicative bias within each stratum. For each of B = 50 iterations, we generate fresh calibration data from P(X = 0) = 0.5, estimate all method-specific parameters, then evaluate bias and mean squared error on 20 test distributions with P(X = 0) ranging from 0.01 to 0.99. The seven methods compared (uncalibrated averaging, Classify & Count with a prevalence-matched threshold, Rogan-Gladen adjustment, 9 --- Page 10 --- PACC, SLD/EMQ, global multiplicative calibration, and multicalibration with stratum-specific additive corrections) are detailed with full mathematical definitions in the SI Appendix. Empirical application (ACS). We use the ACS Public Use Microdata Sample via the folktables package, with the ACSEmployment prediction task (binary: employed vs. not employed) and 16 sociodemographic features. Training data comprises eight states (TX, MI, PA, OH, IL, GA, NC, VA) across 2016–2018. The base model is logistic regression with standard scaling. Post-hoc calibration uses isotonic regression (global) and MCGrad (multicalibration with categorical and numerical segment features) on a held-out calibration set (n ≈644,000). In addition to Classify & Count, Rogan-Gladen, isotonic regression, and MCGrad, we include importance-weighted prevalence estimation (IPW), which estimates density ratios between source and target via logistic regression on all 16 features, and two methods from the quantification literature: PACC (Probabilistic Adjusted Classify & Count) (González et al. 2017) and SLD (Saerens-Latinne-Decaestecker) (Saerens et al. 2002). Uncalibrated averaging is also reported in SI Appendix Table S1. Synthetic age-shifted populations are created by importance-weighted resampling of the test set (n ≈920,000), with exponential weights favoring young ages, old ages, or both extremes. RMSE is computed via bootstrap resampling (200 iterations per scenario). Six additional states (CA, NY, FL, WA, AZ, CO) serve as OOD evaluation data. Empirical application (CAP). We use data from the Comparative Agendas Project (Baumgartner et al. 2006), which provides expert-coded policy topic labels for political texts across countries. The binary outcome is whether a document addresses Law & Crime (CAP major topic code 12). We sample 30,000 documents (5,000 from each of six sub-populations: Danish parliamentary questions, Spanish oral questions, U.S. congressional bills, Belgian newspaper articles, Spanish media articles, and Belgian TV news). The measurement device is Claude Opus 4.6. Each document is classified in two independent campaigns: (1) binary Yes/No classification using the CAP codebook definition of Law & Crime, and (2) direct probability elicitation, where the LLM estimates P(Yes) and P(No) without first committing to an answer. The two campaigns were run independently to avoid anchoring contamination. The calibration set (n ≈13,400) draws equally from four sub-populations (Denmark questions, Spain questions, U.S. bills, and Belgium newspaper). MCGrad is calibrated with categorical features (country, document type, political party) and two numerical features (decade and document length). For the binary-label condition, the LLM’s Yes/No classification is passed to MCGrad as a categorical input feature, with initial scores set to the calibration-set base rate; MCGrad learns feature-conditional prevalence estimates from the metadata and LLM label alone. For the probability-score condition, MCGrad receives P(Yes) as the input score. Classify & Count reports the fraction of positive LLM classifications. Rogan-Gladen adjustment corrects this fraction using binary-label TPR and FPR estimated on the calibration set. Isotonic regression is fitted on probability scores using the same calibration set. IPW estimates density ratios via logistic regression on country, document type, decade, and document length. SLD results are reported in SI Appendix Table S2. Two sub-populations are held out as out-of-distribution targets: Spanish media and Belgian TV news, which differ from the calibration data only in document type while sharing the same countries and languages. Software. MCGrad is available at https://github.com/facebookincubator/MCGrad. Simulation and analysis code are available at https://github.com/facebookresearch/multicalibrated_llm_measurement. Competing Interests The authors declare no competing interests. 10 --- Page 11 --- Disclosure of Delegation to Generative AI The authors declare the use of generative AI in the research and writing process. According to the GAIDeT taxonomy (2025), the following tasks were delegated to GAI tools under full human supervision: • Literature search and systematization • Code generation • Code optimization • Data collection • Data cleaning • Data analysis • Visualization • Reproducibility testing • Text generation • Proofreading and editing • Reformatting • Identification of limitations The GAI tools used were: Claude Opus 4.6, Claude Opus 4.7, Gemini 3 Pro. Responsibility for the final manuscript lies entirely with the authors. GAI tools are not listed as authors and do not bear responsibility for the final outcomes. Declaration submitted by: Fridolin Linder. AI was involved for the listed tasks but no task was exclusively done by AI. All outputs were manually verified and iterated on by the authors. Data Availability The American Community Survey data is publicly available via the folktables package. The Comparative Agendas Project data is publicly available at https://www.comparativeagendas.net. Simulation and analysis code are available at https://github.com/facebookresearch/multicalibrated_llm_measurement. References Angelopoulos, Anastasios N, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. 2023. “Prediction-Powered Inference.” Science 382 (6671): 669–74. Baumgartner, Frank R., Christoffer Green-Pedersen, and Bryan D. Jones. 2006. “Comparative Studies of Policy Agendas.” Journal of European Public Policy 13 (7): 959–74. Benoit, Kenneth, Scott De Marchi, Conor Laver, Michael Laver, and Jinshuai Ma. 2026. “Using Large Language Models to Analyze Political Texts Through Natural Language Understanding.” American Journal of Political Science, ahead of print. https://doi.org/10.1111/ajps.70050. Detommaso, Gianluca, Martin Bertran, Riccardo Fogliato, and Aaron Roth. 2024. “Multicalibration for Confidence Scoring in LLMs.” Proceedings of the 41st International Conference on Machine Learning (ICML). Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023. “ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks.” Proceedings of the National Academy of Sciences 120 (30): e2305016120. 11 --- Page 12 --- González, Pablo, Alberto Castaño, Nitesh V. Chawla, and Juan José Del Coz. 2017. “A Review on Quantification Learning.” ACM Computing Surveys 50 (5): 1–40. Grimmer, Justin, Margaret E. Roberts, and Brandon M. Stewart. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press. Hébert-Johnson, Ursula, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. 2018. “Multicalibration: Calibration for the (Computationally-Identifiable) Masses.” Proceedings of the 35th International Conference on Machine Learning (ICML), 1939–48. Kadavath, Saurav, Tom Conerly, Amanda Askell, et al. 2022. “Language Models (Mostly) Know What They Know.” arXiv Preprint arXiv:2207.05221. Karjus, Andres. 2025. “Machine-Assisted Quantitizing Designs: Augmenting Humanities and Social Sciences with Artificial Intelligence.” Humanities & Social Sciences Communications. Kim, Michael P., Christoph Kern, Shafi Goldwasser, Frauke Kreuter, and Omer Reingold. 2022. “Universal Adaptability: Target-Independent Inference That Competes with Propensity Scoring.” Proceedings of the National Academy of Sciences 119 (4): e2108097119. Lee, Robert Y, Kevin S Li, James Sibley, et al. 2025. “Assessment of a Zero-Shot Large Language Model in Measuring Documented Goals-of-Care Discussions.” Journal of Pain and Symptom Management. Maletzke, André G., Denis M. dos Reis, Everton A. Cherman, and Gustavo E. A. P. A. Batista. 2018. “On the Need of Class Ratio Insensitive Drift Tests for Data Streams.” Proceedings of the 2nd International Workshop on Learning with Imbalanced Domains: Theory and Applications (LIDTA), 110–24. Mellon, Jonathan, Jack Bailey, Ralph Scott, James Breckwoldt, Marta Miori, and Phillip Schmedeman. 2024. “Do AIs Know What the Most Important Issue Is? Using Language Models to Code Open-Text Social Survey Responses at Scale.” Research & Politics 11 (1): 1–7. Oliveira, Rodrigo de, Matthew Garber, James M. Gwinnutt, et al. 2025. “A Study of Calibration as a Measurement of Trustworthiness of Large Language Models in Biomedical Natural Language Processing.” JAMIA Open 8 (4). Overos, Henry David, Roman Hlatky, Ojashwi Pathak, et al. 2024. “Coding with the Machines: Machine- Assisted Coding of Rare Event Data.” PNAS Nexus 3 (5): pgae165. Ren, Kevin, Santiago Cortes-Gomez, Carlos Miguel Patiño, et al. 2025. “Predicting Language Models’ Suc- cess at Zero-Shot Probabilistic Prediction.” Findings of the Association for Computational Linguistics: EMNLP 2025, 18337–63. Rogan, Walter J., and Beth Gladen. 1978. “Estimating Prevalence from the Results of a Screening Test.” American Journal of Epidemiology 107 (1): 71–76. Saerens, Marco, Patrice Latinne, and Christine Decaestecker. 2002. “Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure.” Neural Computation 14 (1): 21–41. Storkey, Amos. 2009. “When Training and Test Sets Are Different: Characterizing Learning Transfer.” In Dataset Shift in Machine Learning. MIT Press. 12 --- Page 13 --- Sushil, Madhumita, Travis Zack, Divneet Mandair, et al. 2024. “A Comparative Study of Large Language Model-Based Zero-Shot Inference and Task-Specific Supervised Classification of Breast Cancer Pathology Reports.” Journal of the American Medical Informatics Association 31 (10): 2315–27. Tax, Niek, Lorenzo Perini, Fridolin Linder, et al. 2026. “MCGrad: Multicalibration at Web Scale.” Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Tian, Katherine, Eric Mitchell, Allan Zhou, et al. 2023. “Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback.” Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Tojima, Tatsuya, and Mitsuo Yoshida. 2025. “Zero-Shot Classification of Art with Large Language Models.” IEEE Access 13: 17426–39. Wang, Xuezhi, Jason Wei, Dale Schuurmans, et al. 2023. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” Proceedings of the 11th International Conference on Learning Repre- sentations (ICLR). Weidmann, Nils B., Mats Faulborn, and David Garcia. 2026. “Large Language Models Are Democracy Coders with Attitudes.” PS: Political Science & Politics 59 (1): 17–23. Wu, Siqi, and Paul Resnick. 2024. “Calibrate-Extrapolate: Rethinking Prevalence Estimation with Black Box Classifiers.” Proceedings of the International AAAI Conference on Web and Social Media (ICWSM) 18: 1634–47. Yang, Yahan, Soham Dan, Dan Roth, and Insup Lee. 2023. “On the Calibration of Multilingual Question Answering LLMs.” arXiv Preprint arXiv:2311.08669. Yuan, Weizhe, Richard Yuanzhe Pang, Kyunghyun Cho, et al. 2024. “Self-Rewarding Language Models.” Proceedings of the 41st International Conference on Machine Learning (ICML). Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. 2023. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” Advances in Neural Information Processing Systems (NeurIPS). 13 --- Page 14 --- SI Appendix: Unbiased Prevalence Estimation with Multicalibrated LLMs SI Appendix S1. Formal Definitions of Prevalence Estimation Methods This section provides full mathematical definitions of the seven prevalence estimation methods compared in the simulation study. Setup. Let h(X) ∈[0, 1] denote the device’s probabilistic prediction for input X, with true label Y ∈{0, 1}. The goal is to estimate the target prevalence π∗= P ∗(Y = 1) using only unlabeled target data {X∗ i }n i=1 and calibration parameters estimated from a labeled source dataset. S1.1 Uncalibrated Averaging ˆπraw = 1 n n X i=1 h(X∗ i ) S1.2 Classify & Count Given a threshold τ chosen on calibration data: ˆπCC = 1 n n X i=1 1[h(X∗ i ) ≥τ] In the simulation, τ is chosen so that ˆπCC matches the true prevalence on the calibration set. S1.3 Rogan-Gladen (Adjusted Count) ˆπRG = ˆπCC −[ FPR [ TPR −[ FPR where [ TPR and [ FPR are estimated from calibration data at threshold τ (Rogan and Gladen 1978). 1 --- Page 15 --- S1.4 Probabilistic Adjusted Classify & Count (PACC) ˆπPACC = ¯h −ˆµ0 ˆµ1 −ˆµ0 where ¯h = 1 n P i h(X∗ i ), ˆµ1 = ˆE[h(X)|Y = 1], and ˆµ0 = ˆE[h(X)|Y = 0] are estimated from calibration data (González et al. 2017). S1.5 SLD (EMQ) The Saerens-Latinne-Decaestecker algorithm iterates: 1. Initialize ˆπ(0) from the source prevalence. 2. E-step: Adjust posteriors for the new prior: ˜h(t) i = (ˆπ(t)/πs) · h(X∗ i ) (ˆπ(t)/πs) · h(X∗ i ) + ((1 −ˆπ(t))/(1 −πs)) · (1 −h(X∗ i )) 3. M-step: ˆπ(t+1) = 1 n P i ˜h(t) i 4. Repeat until convergence (Saerens et al. 2002). S1.6 Global Calibration In the simulation, global calibration applies a multiplicative correction: hcal(X) = c · h(X) where c = ¯Ycal/¯hcal is estimated on calibration data. In the empirical applications, global calibration uses isotonic regression. S1.7 Multicalibration In the simulation, which has a single binary covariate, multicalibration reduces to stratum-specific additive corrections: hmc(X) = h(X) + ˆϵg for X ∈stratum g where ˆϵg = ¯Yg −¯hg is estimated on calibration data within each stratum. In the empirical applications, we use MCGrad (Tax et al. 2026), a multicalibration algorithm based on gradient boosting. MCGrad operates in logit space: given a base predictor f0(X) with logit F0(X) = logit(f0(X)), it iteratively fits gradient boosted decision trees (GBDTs) on the residuals between labels and current predictions. At each round t, a GBDT gt is trained with the current logit predictions as init_score and with the feature matrix consisting of the segment features (categorical and numerical) augmented by the current logit prediction as an additional input feature. The logit predictor is then updated as Ft+1(X) = αt · (Ft(X) + gt(X)), where αt is an unshrinkage factor estimated by logistic regression to counteract the GBDT’s learning rate. By including the prediction as a feature, GBDT splits naturally discover miscalibrated regions in the joint space of features and score levels, thereby approximating multicalibration without requiring explicit group specification. Early stopping on a validation set prevents overfitting. MCGrad uses LightGBM as the GBDT implementation. See Tax et al. (2026) for convergence results and deployment details. 2 --- Page 16 --- S2. Robustness: Replication with Open-Weight LLM (Llama 3.3 70B) The main text reports results using Claude Opus 4.6 as the LLM measurement device. To verify that the findings are not specific to a particular model, we replicate the CAP analysis using Llama 3.3 70B Instruct (Grattafiori et al. 2024) (4-bit NF4 quantized, run on a single A100 80GB GPU). This section reports results using two score extraction methods: token log-probabilities and verbalized confidence elicitation. S2.1 Score Extraction Methods Log-probabilities. For each document, the model is prompted with the CAP codebook definition of Law & Crime and asked to respond Yes or No. The score is extracted from next-token log- probabilities: h(X) = P(Yes)/(P(Yes) + P(No)). This produces highly bimodal scores: 23% of the 105,000 documents score at exactly 0.0 or 1.0, and only 6% fall in the mid-range [0.1, 0.9]. Because MCGrad’s internal logit transform maps values near 0 and 1 to ±∞, a linear squashing transformation h′(X) = ϵ + (1 −2ϵ) · h(X) with ϵ = 0.05 is applied before fitting MCGrad. Verbalized confidence (2-stage). A two-stage dialogue first asks the model to classify the document (Yes/No), then asks it to estimate the probability that its answer is correct, with an anti-certainty instruction (“Note: very few things are 0% or 100% certain”) to discourage degenerate outputs (Tian et al. 2023). The score is P(correct) if the answer is Yes and 1 −P(correct) if No. This produces scores in [0.01, 0.99] with negligible boundary mass and 11 unique score values. No squashing is required. S2.2 Data and Calibration The Llama analysis uses the full 105,000-document sample (15,000 per sub-population for Denmark questions, Spain questions, U.S. bills, and Belgium newspaper; 30,000 for Spanish media; 15,000 for Belgian TV). The calibration set (n ≈40,000) is drawn equally from the four in-distribution sub-populations. MCGrad is calibrated with categorical features (country, document type, party) and one numerical feature (decade). S2.3 Results: Verbalized Confidence Scores Table S3 shows prevalence estimation bias using Llama 3.3 70B with verbalized confidence scores. Scenario Shift Type True Prev. CC RG IPW Iso. MCGrad Baseline None 8.1% +14.7 +0.6 +0.1 +0.2 +0.2 Country shift Within- cal. 8.7% +15.6 +2.2 -0.3 +1.3 +0.1 Doc-type shift Within- cal. 6.5% +16.0 +1.7 +0.0 +0.7 +0.1 Spain media OOD doc type 19.3% +15.4 +6.6 -9.8 -7.2 -4.9 3 --- Page 17 --- Scenario Shift Type True Prev. CC RG IPW Iso. MCGrad Belgium TV OOD doc type 11.1% +13.3 -0.0 -3.5 -1.6 -3.4 Table S3: Prevalence estimation bias (pp) for Law & Crime topic using Llama 3.3 70B with verbalized confidence scores. CC = Classify & Count, RG = Rogan-Gladen, IPW = importance-weighted estimation, Iso. = isotonic regression. The pattern is consistent with the main text’s Claude Opus results: MCGrad achieves near-zero bias within the calibration distribution (≤0.2pp) and degrades on OOD populations (-3.4 to -4.9pp). Several differences are notable: • Higher raw CC bias (+14-16pp vs. +2-5pp with Opus). • Comparable MCGrad within-calibration performance (≤0.2pp for both models), confirming that multicalibration corrects for model-specific calibration errors. • Larger OOD bias on Spanish media (-4.9pp vs. -2.5pp with Opus binary labels), reflecting the combination of a weaker base model with the coarser verbalized score distribution (11 unique values vs. Opus’s 43). S2.4 Score Distribution: Log-Probabilities vs. Verbalized Confidence The bimodal distribution of Llama’s log-probability scores illustrates a broader challenge for LLM- based measurement. RLHF-tuned instruction-following models tend to produce highly confident outputs, pushing token probabilities toward 0 or 1. This creates two problems for prevalence estimation: (1) the scores carry little information about uncertainty, producing large raw bias even at baseline (+18pp), and (2) post-hoc calibration methods that operate in logit space (including MCGrad) require score preprocessing to avoid numerical instability. Verbalized confidence elicitation partially addresses both problems by producing scores that are better distributed (75% in [0.1, 0.9]) and better calibrated out of the box (log loss 0.525 vs. 1.707 for log-probabilities). However, the scores remain coarsely discretized (11 unique values), and as shown in both the Llama and Opus analyses, the quality of the input scores matters less than the metadata features for MCGrad’s prevalence estimation performance under shift. S2.5 Additional Baselines: SLD and PACC on Llama Scores The SLD (EMQ) algorithm, designed for label shift rather than covariate shift, diverges catastroph- ically on Llama’s verbalized confidence scores, producing prevalence estimates biased by +33 to +60pp. This occurs because the verbalized scores are not calibrated posteriors, violating SLD’s core assumption. PACC shows moderate bias (+0.6 to +5.9pp within calibration, +2.4 to +5.9pp OOD). Full results including SLD and PACC are available in the replication code. S3. Detailed Results Tables Table S1: ACS Employment Prevalence Estimation Bias 4 --- Page 18 --- Setting Age Dist. True Prev. Raw CC RG PACC SLD IPW Iso. MCGrad In- Dist Original 46.0% -0.31 -0.07 +0.26 -0.07 +0.01 -0.3 -0.30 -0.27 In- Dist Young- skewed 12.8% +1.93 +2.47 -12.82 -12.82 -11.95 -0.3 +2.04 -0.11 In- Dist Old- skewed 16.8% +7.23 -6.62 -16.77 -16.77 -16.76 -1.2 +6.65 +0.22 In- Dist Bimodal 21.1% +4.57 -1.50 -18.62 -19.97 -16.14 +4.7 +4.33 +0.12 OOD Original 45.1% +1.15 +1.40 +2.12 +2.13 +2.25 +1.7 +1.17 +1.35 OOD Young- skewed 13.0% +2.93 +3.73 -12.96 -12.96 -11.38 +0.8 +3.08 +0.88 OOD Old- skewed 16.0% +8.47 -5.64 -15.97 -15.97 -15.97 +0.1 +7.91 +1.01 OOD Bimodal 20.8% +5.91 +0.09 -16.14 -17.27 -15.20 +6.3 +5.69 +1.13 Prevalence estimation bias in percentage points (pp) under synthetic age distribution shift. Raw = uncalibrated averaging, CC = Classify & Count, RG = Rogan-Gladen, IPW = importance-weighted prevalence estimation, Iso. = Isotonic regression. Bootstrap RMSE (200 iterations) closely tracks absolute bias in all scenarios. Table S2: CAP Law & Crime Prevalence Estimation Bias (Claude Opus 4.6) Scenario Shift Type True Prev. CC RG SLD IPW Iso. MC (bi- nary) MC (scores) Baseline None 7.9% +2.2 +0.5 +7.4 +0.1 +0.1 +0.1 +0.2 Country shift Within- cal. 8.4% +3.3 +1.7 +9.5 +0.1 +0.9 +0.4 +0.4 Doc- type shift Within- cal. 6.3% +1.6 -0.3 +4.9 +0.1 +0.0 -0.0 +0.1 Spain media OOD doc type 19.5% +3.6 +3.1 +21.6 -12.2 -2.5 -1.9 -4.5 Belgium TV OOD doc type 11.1% +4.8 +3.7 +13.7 -4.5 +2.6 +0.7 +1.6 CC = Classify & Count (fraction of Yes labels); RG = Rogan-Gladen adjustment on binary labels; SLD = Saerens-Latinne-Decaestecker (label shift, applied to probability scores); IPW = importance- weighted estimation (target-specific density ratio); Iso. = isotonic regression on probability scores; MC (binary) = MCGrad on binary labels with base-rate initialization; MC (scores) = MCGrad on probability scores. 5 --- Page 19 --- S4. Simulation: RMSE Figure S2: Root mean squared error (RMSE) under covariate shift for the same four methods shown in Figure 1, averaged over 50 simulation runs. RMSE closely tracks absolute bias for all methods, confirming that variance is small relative to bias at this sample size. MCGrad maintains the lowest RMSE across all shift levels. S5. Simulation: All Methods Figure S3: Simulation bias curves for all seven methods. Rogan-Gladen and PACC exhibit catas- trophic failure (bias exceeding -200% at extreme shifts). SLD shows large bias under covariate shift because it assumes label shift. Uncalibrated averaging shows moderate bias. MCGrad maintains near-zero bias throughout. 6 --- Page 20 --- S6. Claude Opus Score Distribution Figure S4: Claude Opus 4.6 P(Yes) score distribution by label across six CAP sub-populations. Scores are well-separated (mean 0.75 for positives vs. 0.07 for negatives) with 43 unique values and no boundary mass. References González, Pablo, Alberto Castaño, Nitesh V. Chawla, and Juan José Del Coz. 2017. “A Review on Quantification Learning.” ACM Computing Surveys 50 (5): 1–40. Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. “The Llama 3 Herd of Models.” arXiv Preprint arXiv:2407.21783. Rogan, Walter J., and Beth Gladen. 1978. “Estimating Prevalence from the Results of a Screening Test.” American Journal of Epidemiology 107 (1): 71–76. Saerens, Marco, Patrice Latinne, and Christine Decaestecker. 2002. “Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure.” Neural Computation 14 (1): 21–41. Tax, Niek, Lorenzo Perini, Fridolin Linder, et al. 2026. “MCGrad: Multicalibration at Web Scale.” Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Tian, Katherine, Eric Mitchell, Allan Zhou, et al. 2023. “Just Ask for Calibration: Strategies 7 --- Page 21 --- for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback.” Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8

Chunks

Chunk 0

--- Page 1 --- Unbiased Prevalence Estimation with Multicalibrated LLMs Fridolin Linder1 Thomas Leeper1 Daniel Haimovich1 Niek Tax1 Lorenzo Perini1 Milan Vojnovic1,2 1Meta Platforms Inc., 2The London School of Economics and Political Science Corresponding author: Fridolin Linder (flinder@meta.com) Classification: Social Sciences / Political Science; Physical Sciences / Statistics Keywords: multicalibration | large language models | prevalence estimation | covariate shift | quantification 1 --- Page 2 --- Significance Large language models are increasingly used as measurement devices to estimate prevalence in populations. A critical but overlooked problem arises when the target population differs from the validation population: standard methods produce biased prevalence estimates, even when the model achieves high classification accuracy.

Chunk 1

We show that multicalibration, requiring a device to be accurate conditional on input fea- tures—rather than just on average—is sufficient for unbiased prevalence estimation under covariate shift. Our theoretical and empirical results imply that the rapidly growing body of LLM-based measurement research is vulnerable to systematic bias that can be mitigated by enforcing multicalibration.

Chunk 2

Abstract Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic tests, classifiers, or large language models) is fundamental to science, public health, and online trust and safety. Standard approaches correct for known device error rates but assume these rates remain stable across populations.

Chunk 3

We show this assumption fails under covariate shift and that multicalibration, which enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this guarantee.

Chunk 4

Our work connects recent theoretical work on fairness to a longstanding measurement problem spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias growing with shift magnitude, while a multicalibrated estimator maintains near-zero bias.

Chunk 5

While we focus the discussion mostly on LLMs, our theoretical results apply to any classification model. Two empirical applications—estimating employment prevalence across U.S.

Chunk 6

states using the American Community Survey, and classifying political texts across four countries using an LLM—demonstrate that multicalibration substantially reduces bias in practice, while highlighting that calibration data should cover the key feature dimensions along which target populations may differ. Introduction Large language models (LLMs) are increasingly used as measurement devices for estimating the prevalence of a category in a population.

Chunk 7

Researchers now routinely deploy LLMs as zero-shot classifiers to estimate the frequency of phenomena that previously required expensive manual annotation: coding democracy indicators across countries (Weidmann et al. 2026), classifying protest events in news corpora (Overos et al.

Chunk 8

2024), estimating party policy positions from manifestos across dozens of countries and languages (Benoit et al. 2026), categorizing open-ended survey responses at near-human accuracy (Mellon et al.

Chunk 9

2024; Gilardi et al. 2023), extracting diagnostic attributes from pathology reports (Sushil et al.

Chunk 10

2024), identifying goals-of-care discussions in clinical notes (Lee et al. 2025), annotating art forms in auction records (Tojima and Yoshida 2025), and converting qualitative text into quantitative variables across multiple languages (Karjus 2025).

Chunk 11

LLMs are also being deployed for content moderation1 and as LLM “judges” to assess the quality and safety of AI systems (Zheng et al. 2023; Yuan et al.

Chunk 12

2024). These applications are typically validated by reporting discriminative performance: accuracy, AUC, or agreement with human annotations.

Chunk 13

Researchers then use or advocate for using validated LLMs as measurement devices across new populations, time periods, or subgroups. But strong discriminative performance does not guarantee accurate prevalence estimates from the classifications provided by these devices.

Chunk 14

When the composition of the target population differs from the validation population, a model that separates positives from negatives well can still produce prevalence estimates that are substantially 1https://openai.com/index/using-gpt-4-for-content-moderation/ 2 --- Page 3 --- biased, because discrimination is invariant to the calibration errors that drive prevalence bias. This problem is often missed in standard validation practice and, as we show, can produce large bias even with strong classification performance.

Chunk 15

Neither the confidence elicitation literature nor the quantification2 literature solves this problem. Confidence elicitation methods — verbalized confidence (Tian et al.

Chunk 16

2023), token log-probabilities (Kadavath et al. 2022), consistency sampling (Wang et al.

Chunk 17

2023) — replace the LLM’s bare binary classification with a probability score, and post-hoc calibration (Oliveira et al. 2025) can further refine those scores.

Chunk 18

Both target global calibration: calibration on the population where they are evaluated. Empirical evidence confirms that such calibration does not transfer: it degrades 2–3× under language shift even when accuracy is preserved (Yang et al.

Chunk 19

2023), and varies widely across tasks and domains (Ren et al. 2025).

Chunk 20

Quantification methods such as “Classify & Count”, “Adjusted Count”, and the Saerens-Latinne-Decaestecker (SLD) EM algorithm (see González et al. (2017) for a detailed survey) attempt to correct for classifier errors directly, but rely on the assumption that error properties remain static across populations.

Chunk 21

Importance-weighted methods handle covariate shift in principle but require re-estimating density ratios for each new target population. None of these approaches provides a measurement device that can be validated once and then reliably applied to new populations.

Chunk 22

We show that multicalibration — calibration conditional on the input features, not just on average — fills this gap. Building on Kim et al.

Chunk 23

(2022)‘s “Universal Adaptability” result, we show that a multicalibrated device requires no target-specific estimation: it is calibrated once on source data and produces unbiased prevalence estimates on any target population whose features lie within the calibration support. This property is critical for LLM-based measurement, where a single device is typically applied to many populations without the opportunity to re-calibrate for each one.

Chunk 24

Multicalibration can be applied to any LLM output — from discrete Yes/No classifications (the standard practice in applied research) to continuous confidence scores — because it operates on the scores’ relationship to outcomes conditional on features, not on the scores’ intrinsic quality. Our results apply more generally to any model-based measurement device, but the LLM setting is where the practical need is most urgent and the calibration problem most overlooked.

Chunk 25

We review existing approaches to the quantification problem, connect the quantification task to the broader theory of domain adaptation via multicalibration, and apply those insights to two empirical applications: a controlled demonstration using a standard ML classifier to estimate employment prevalence under age distribution shift (American Community Survey), followed by the main application of classifying political texts across four countries using an LLM as a zero-shot measurement device. Results Standard calibration methods fail under covariate shift Consider a binary outcome Y ∈{0, 1}, features X, and a device h(X) ∈[0, 1] producing probabilistic predictions.

Chunk 26

The goal is to estimate population prevalence π = P(Y = 1) in a target population using only unlabeled target data {Xi} and the device h. We focus on covariate shift (Storkey 2009), where P(X) changes across populations but P(Y | X) remains stable.

Chunk 27

This is the natural assumption when features causally influence outcomes (X →Y ), as in many measurement applications. See Limitations for discussion of label shift (Y →X) and concept drift.

Chunk 28

A device is globally calibrated if E[Y | h(X) = p] = p for all prediction values p. Under global calibration, E[h(X)] = E[Y ] = π, so the sample mean of predictions is an unbiased prevalence estimate.

Chunk 29

However, 2The task of estimating prevalence from imperfect classifiers is known as quantification in the machine learning literature; we use this term interchangeably with prevalence estimation. 3 --- Page 4 --- global calibration is a property of a specific population.

Chunk 30

A device calibrated on one population need not be calibrated on another, even under covariate shift with stable P(Y | X). The failure mechanism is as follows.

Chunk 31

Suppose the population consists of subgroups G with weights wG. Within each subgroup, the device may be biased: let ϵG = E[Y −h(X) | X ∈G] denote the mean prediction error within group G.

Chunk 32

Global calibration requires only that these errors cancel on average: P G wGϵG = 0. Under covariate shift, the group weights change to w∗ G, and the bias in the prevalence estimate becomes P G w∗ GϵG, which is generally nonzero unless ϵG = 0 for all G.

Chunk 33

A device can be perfectly globally calibrated, with errors that precisely cancel in the training population, while having nonzero mean prediction error within every individual subgroup. Standard quantification methods are each vulnerable to this failure mechanism (González et al.

Chunk 34

2017). Classify & Count, Rogan-Gladen adjustment (Rogan and Gladen 1978), and Probabilistic Adjusted Classify & Count (PACC) all rely on error rates or conditional score means estimated from calibration data; these are weighted averages over subgroups and shift with population composition.

Chunk 35

The SLD/EMQ algorithm (Saerens et al. 2002) and more recent distribution-matching methods such as DyS and HDy (Maletzke et al.

Chunk 36

2018) assume label shift (P(X | Y ) stable) and fail under covariate shift. Even global calibration via isotonic regression, recommended by Wu and Resnick (2024) for covariate shift settings, produces biased prevalence estimates because global calibration does not guarantee feature-conditional accuracy.

Chunk 37

Multicalibration guarantees unbiased prevalence estimation The analysis above shows that all standard methods produce biased prevalence estimates under covariate shift, with bias P G w∗ GϵG that is generally nonzero unless every ϵG = 0. Multicalibration (Hébert-Johnson et al.

Chunk 38

2018) formalizes this requirement. A predictor f(X) is multical- ibrated with respect to a collection of subgroups G if E[Y | f(X) = v, X ∈G] = v for every G ∈G and prediction value v.

Chunk 39

When G is rich enough to capture all relevant structure in X, this is equivalent to requiring calibration conditional on the full feature space: E[Y | f(X) = v, X = x] = v for all x and v with sufficient probability mass. This is the framing of Kim et al.

Chunk 40

(2022)’s “Universal Adaptability”: a predictor calibrated conditional on X yields correct expected values under any reweighting of P(X), without requiring knowledge of the shift. The connection to prevalence estimation is direct.

Chunk 41

If f is calibrated conditional on X, then E[f(X) | X = x] = E[Y | X = x] for all x. Under covariate shift, where only P(X) changes while P(Y | X) remains stable, the law of iterated expectations gives (where E∗and P ∗denote expectations and probabilities under the target distribution, and π∗= P ∗(Y =1) is the target prevalence): E∗[f(X)] = Z E[f(X) | X = x] dP ∗(x) = Z E[Y | X = x] dP ∗(x) = E∗[Y ] = π∗.

Chunk 42

Because predictions are correct at each point in the feature space, the average of predictions tracks the true prevalence under any change in P(X). The condition strictly necessary for this robustness is multi-accuracy: E[f(X)−Y | X ∈G] = 0 for all groups G that the shift can reweight.

Chunk 43

Multicalibration, which additionally requires E[Y | f(X) = v, X ∈G] = v, is strictly stronger and implies multi-accuracy. Multicalibration is the preferable target for three reasons.

Chunk 44

First, practical post-hoc algorithms like MCGrad (Tax et al. 2026)3 naturally produce multicalibrated 3An open-source implementation is available at https://mcgrad.dev.

Chunk 45

4 --- Page 5 --- predictors (Hébert-Johnson et al. 2018), so the stronger guarantee comes at no additional cost.

Chunk 46

Second, multicalibration covers a broader set of use cases where multi-accuracy no longer suffices and full calibration conditional on both features and score level is required: when the population shift is mediated by the device’s scores (e.g., using the device’s scores to decide which items to label, or threshold-based filtering), when prevalence is estimated within score strata, or when scores are used as inputs to downstream regressions. Third, multicalibration provides robustness against misspecification of which features drive the shift, since a predictor multicalibrated with respect to a rich class of subgroups is automatically multi-accurate for any sub-partition.

Chunk 47

Both conditions require that the calibration features capture the dimensions along which the population shift occurs (an ignorability assumption). Simulation: standard methods fail, multicalibration succeeds We illustrate the theoretical results above with a simulation.

Chunk 48

We generate data with a binary covariate X ∈{0, 1} and binary outcome Y , where P(Y = 1 | X = 1) = 0.85 and P(Y = 1 | X = 0) = 0.15. We simulate a classifier with hardcoded, systematically biased predictions: 10% underestimation when X = 0 and 10% overestimation when X = 1.

Chunk 49

All calibration parameters are learned on a balanced training distribution (P(X = 0) = 0.5). Prevalence is estimated by averaging the predicted probabilities over the target sample.

Chunk 50

We then evaluate prevalence estimates on test distributions where P(X = 0) ranges from 0.01 to 0.99, repeating 50 times (details in Materials and Methods). Figure 1: Prevalence estimation bias (% relative) under covariate shift, averaged over 50 simulation runs.

Chunk 51

The x-axis shows ∆P(X =0), the change in P(X =0) from the training value of 0.5. Classify & Count and Rogan-Gladen diverge with increasing shift; isotonic regression (global calibration) shows moderate bias; MCGrad maintains near-zero bias across all shift levels.

Chunk 52

Classify & Count and Rogan-Gladen curves are cropped at the ±40% axis limits; their bias continues to grow beyond this range, exceeding ±100% at extreme shifts (see SI Appendix Figure S3 for full range). Additional methods (PACC, SLD, uncalibrated averaging) also shown in SI Appendix Figure S3.

Chunk 53

Figure 1 shows the results. At the training distribution (center), all methods produce approximately unbiased estimates.

Chunk 54

As the distribution shifts, the methods diverge. Rogan-Gladen is particularly unstable: its ratio structure amplifies estimation errors, with bias exceeding ±40% at extreme shifts.

Chunk 55

Classify & Count shows growing bias in the same direction. Isotonic regression (global calibration) shows more 5 --- Page 6 --- moderate but still substantial bias (up to 15% at extreme shifts).

Chunk 56

SLD and PACC exhibit comparably large failures (SI Appendix Figure S3). The multicalibrated estimator maintains near-zero bias and the lowest RMSE across the entire range of distribution shifts (SI Appendix Figure S2), confirming that the bias reduction does not come at the cost of increased variance.

Chunk 57

Empirical application: employment prevalence under age distribution shift Before applying multicalibration to LLM-generated scores, which present additional challenges such as coarse discretization and non-standard score distributions, we first demonstrate the core mechanism in a controlled setting with a traditional machine learning classifier. We analyze data from the American Community Survey (ACS), a large-scale annual survey conducted by the U.S.

Chunk 58

Census Bureau. The concept to be measured is the rate of employment.

Chunk 59

The measurement device is a logistic regression model of binary employment status, with 16 sociodemographic features including age, education, marital status, disability status, and citizenship. Setup.

Chunk 60

We train a logistic regression classifier on data from eight U.S. states (TX, MI, PA, OH, IL, GA, NC, VA) across 2016–2018, totaling approximately 1.5 million training observations.

Chunk 61

The remaining in-distribution data is split into a calibration set (n ≈644,000) and a test set (n ≈920,000). Six additional states (CA, NY, FL, WA, AZ, CO) are held out for out-of-distribution (OOD) evaluation.

Chunk 62

We fit two post-hoc calibration methods on the calibration set: isotonic regression (global calibration) and MCGrad, a multicalibration algorithm that enforces calibration conditional on both categorical and numerical features. We compare five prevalence estimation methods (Figure 2): Classify & Count with a prevalence-matched threshold, Rogan-Gladen adjustment, importance-weighted estimation (IPW), isotonic regression, and MCGrad.

Chunk 63

Additional methods (PACC, SLD, uncalibrated averaging) are reported in SI Appendix Table S1. Age distribution shift.

Chunk 64

Employment rates vary dramatically by age: approximately 47% for ages 16–24, 76% for ages 25–54, 61% for ages 55–64, and 17% for ages 65+. This makes age an ideal dimension along which to construct meaningful distribution shifts.

Chunk 65

We create synthetic target populations by resampling test data with shifted age distributions: young-skewed (oversampling ages 16–30), old-skewed (oversampling ages 60+), and bimodal (oversampling both tails). The resulting populations have true employment rates ranging from 12.8% to 46.0%.

Chunk 66

All calibration parameters are estimated once on the original calibration set and held fixed across scenarios. Figure 2: Prevalence estimation |bias| (percentage points) under synthetic age distribution shift, for in-distribution states (left) and out-of-distribution states (right).

Chunk 67

Each marker shape represents a different age shift scenario; horizontal lines show the mean across scenarios. Full numerical results in SI Appendix Table S1.

Chunk 68

6 --- Page 7 --- Results. With no age shift, all methods produce approximately unbiased estimates (Figure 2).

Chunk 69

Under shift, the methods diverge sharply. Rogan-Gladen fails severely (12–19pp bias under age shift).

Chunk 70

SLD, designed for label shift rather than covariate shift, shows comparably large bias (SI Appendix Table S1). Classify & Count and isotonic regression show moderate but growing bias (up to 8pp under age shift).

Chunk 71

IPW performs well on simple shifts (≤1.2pp) but fails on the bimodal shift (+4.7pp in-distribution, +6.3pp OOD) where the density ratio is hard to model. MCGrad produces near-zero bias across all in-distribution scenarios (≤0.27pp), including the bimodal shift where IPW struggles.

Chunk 72

In the OOD setting (held-out states with age shift), MCGrad’s bias increases modestly (0.88–1.35pp), reflecting geographic shift along an uncalibrated dimension. Even so, MCGrad maintains the lowest bias and RMSE across all scenarios.

Chunk 73

LLM-based topic classification under cross-national shift The ACS application uses a standard machine learning classifier with well-distributed probability estimates. We now examine the setting that more immediately motivates this paper: using an LLM as a zero-shot measurement device across multiple countries and languages.

Chunk 74

We compare two LLM output modes that reflect current practice and the confidence elicitation literature, respectively: (1) discrete Yes/No classifications, which is how LLMs are used in all of the applied studies cited in the introduction, and (2) probability scores obtained via direct probability elicitation, where the LLM estimates P(Yes) and P(No) without first committing to an answer. This design lets us test whether confidence elicitation improves prevalence estimation, and whether multicalibration is needed in either case.

Chunk 75

Setup. We use Claude Opus 4.6 to classify 30,000 political texts from the Comparative Agendas Project (Baumgartner et al.

Chunk 76

2006) as related to Law & Crime (CAP major topic code 12). The data comprises six sub-populations across four countries and four languages (5,000 documents each): Danish parliamentary questions, Spanish oral questions, U.S.

Chunk 77

congressional bills, Belgian newspaper articles, Spanish media articles, and Belgian TV news. Each document is classified twice in independent campaigns: once for a binary Yes/No label, and once for direct probability estimates P(Yes) and P(No) (summing to 1.0).

Chunk 78

The LLM achieves strong discriminative performance: AUC 0.960 for binary labels and 0.987 for probability scores on the in-distribution test set, with per-language AUCs of 0.983–0.994 for the probability scores. The calibration set (n ≈13,400) draws equally from four sub-populations (Denmark questions, Spain questions, U.S.

Chunk 79

bills, Belgium newspaper), ensuring that all four countries and languages are represented. Two sub-populations are held out as out-of-distribution targets: Spanish media and Belgian TV news, which share country and language with calibration data but introduce an unseen document type.

Chunk 80

MCGrad is calibrated with categorical features (country, document type, party) and two numerical features (decade, document length). For the binary-label condition, MCGrad receives the LLM’s Yes/No classification as a categorical input feature, with all initial scores set to the calibration-set base rate.

Chunk 81

MCGrad then learns feature-conditional prevalence estimates from the label and metadata alone, without any probability score from the LLM. For the probability-score condition, MCGrad receives the LLM’s P(Yes) directly as the input score and calibrates it conditional on the same metadata features.

Chunk 82

Results. Figure 3 shows prevalence estimation bias across five scenarios: a baseline with no shift, two within-calibration shifts (country composition, document type composition), and two OOD scenarios.

Chunk 83

7 --- Page 8 --- Figure 3: Prevalence estimation |bias| (percentage points) across the shift gradient. Each marker shape represents a different shift scenario; horizontal lines show the mean across scenarios.

Chunk 84

MCGrad achieves near-zero bias within the calibration distribution in both binary-label and probability-score conditions. Full numerical results in SI Appendix Table S2.

Chunk 85

The failure pattern is clearly visible: Classify & Count, the standard practice of counting positive LLM classifications, produces bias of +1.6 to +4.8pp across all scenarios. The Rogan-Gladen adjustment reduces bias within calibration but still shows +3.1 to +3.7pp on OOD populations.

Chunk 86

Isotonic regression shows moderate bias within calibration (≤0.9pp) but larger OOD bias (2.5–2.6pp). SLD, designed for label shift, shows very large bias (+5 to +22pp; SI Appendix Table S2).

Chunk 87

MCGrad on binary labels achieves near-zero bias within the calibration distribution (≤0.4pp) and degrades gracefully on OOD targets: +0.7pp on Belgian TV and -1.9pp on Spanish media. This is achieved using only the LLM’s Yes/No classification and document metadata, with no probability scores.

Chunk 88

MCGrad on probability scores achieves comparable within-calibration performance (≤0.4pp) but shows larger OOD bias on Spanish media (-4.5pp) and Belgian TV (+1.6pp). The finding that binary labels with MCGrad can match or outperform probability scores with MCGrad suggests that the metadata features, not the input score quality, drive the calibration improvement.

Chunk 89

IPW, the standard covariate-shift method, achieves near-zero within-calibration bias but fails severely on OOD populations (-12.2pp on Spanish media) because the target contains feature values absent from the source (a positivity violation). Unlike MCGrad, IPW also requires re-estimating density ratios for each target population.

Chunk 90

Comparison with the ACS application. Across both applications, the pattern is consistent: MCGrad achieves near-zero bias when the target population’s features are within the calibration support, and degrades when the shift is along an uncalibrated dimension.

Chunk 91

IPW matches MCGrad within calibration but fails more severely on OOD targets. MCGrad’s practical advantage is that it requires no target-specific estimation: a single calibrated device can be applied to any target population.

Chunk 92

This is the “universal adaptability” property of Kim et al. (2022).

Chunk 93

We replicate the CAP analysis using Llama 3.3 70B Instruct (an open-weight model) in SI Appendix S2; the results are consistent, confirming that the findings are not specific to a particular LLM. Discussion We have shown that multicalibration—calibration conditional on the input features rather than just on average—is sufficient for accurate model-based prevalence estimation under population shift.

Chunk 94

The minimal theoretical requirement is multi-accuracy (correct predictions on average within each subgroup), 8 --- Page 9 --- which multicalibration implies. Both conditions require that the calibration features capture the relevant dimensions of population shift, and both require that the target population’s features lie within the support of the calibration distribution.

Chunk 95

Both empirical applications confirm this: when calibrated features overlap with the target population, MCGrad achieves near-zero bias; when the target introduces a novel feature value, bias increases with the severity of the shift. Practitioners should ensure their calibration data covers the key dimensions along which target populations may differ.

Chunk 96

A central practical advantage of multicalibration is that it produces a target-independent measurement device. Unlike importance-weighted methods, which require re-estimating density ratios for each new target and fail under positivity violations (IPW: -12.2pp on CAP Spanish media, vs.

Chunk 97

MCGrad: -1.9pp), a multicalibrated device is calibrated once and deployed without target-specific re-estimation. This is the “universal adaptability” property of Kim et al.

Chunk 98

(2022). Our CAP results further show that MCGrad on discrete binary labels achieves comparable or better prevalence estimation than MCGrad on continuous probability scores, suggesting that metadata features matter more than input score quality for prevalence estimation under shift.

Chunk 99

This has immediate practical implications: researchers using the standard workflow of prompting an LLM for Yes/No classifications can apply multicalibration directly to those labels, without needing confidence elicitation. Multicalibration does not require access to a model’s internals: observable metadata (document source, language, text length) can serve as segment features (Detommaso et al.

Chunk 100

2024). Researchers who currently validate by reporting accuracy or AUC (Grimmer et al.

Chunk 101

2022) should be aware that these metrics provide no information about the calibration errors that drive prevalence bias. Several limitations warrant discussion.

Chunk 102

First, the guarantee holds only for shifts along calibrated dimensions; out-of-domain performance degrades when target populations introduce novel feature values. Second, multicalibration requires labeled calibration data of sufficient size.

Chunk 103

Our two applications show MCGrad performing well with both large (644K, ACS) and moderate (13.4K, CAP) calibration sets, but the minimum required depends on the complexity of the feature space. In zero-shot LLM settings, calibration labels may require the manual annotation LLMs were intended to avoid; prediction-powered inference (Angelopoulos et al.

Chunk 104

2023) offers a complementary framework. Third, our framework assumes covariate shift (P(Y | X) stable); under concept drift, no purely statistical correction substitutes for new labeled data.

Chunk 105

Finally, both applications use settings where ground-truth labels enable direct bias measurement; in practice, practitioners may lack such labels. These results connect two literatures that have developed in isolation.

Chunk 106

The quantification literature has focused on correcting aggregate error rates but has not engaged with feature-conditional calibration (González et al. 2017; Wu and Resnick 2024).

Chunk 107

The multicalibration literature has focused on individual-level prediction quality and fairness but has not emphasized the implications for population-level inference (Hébert-Johnson et al. 2018; Kim et al.

Chunk 108

2022). Our contribution is to show that calibration conditional on the feature space is what makes LLM-based prevalence estimation reliable under the distribution shifts that motivate its use.

Chunk 109

Materials and Methods Simulation. We generate n = 10,000 observations with a binary covariate X ∈{0, 1} and outcome Y ∼Bernoulli(P(Y | X)), where P(Y = 1 | X = 1) = 0.85 and P(Y = 1 | X = 0) = 0.15.

Chunk 110

The classifier produces deterministic scores ˆp(X = 0) = 0.135 and ˆp(X = 1) = 0.935, representing 10% multiplicative bias within each stratum. For each of B = 50 iterations, we generate fresh calibration data from P(X = 0) = 0.5, estimate all method-specific parameters, then evaluate bias and mean squared error on 20 test distributions with P(X = 0) ranging from 0.01 to 0.99.

Chunk 111

The seven methods compared (uncalibrated averaging, Classify & Count with a prevalence-matched threshold, Rogan-Gladen adjustment, 9 --- Page 10 --- PACC, SLD/EMQ, global multiplicative calibration, and multicalibration with stratum-specific additive corrections) are detailed with full mathematical definitions in the SI Appendix. Empirical application (ACS).

Chunk 112

We use the ACS Public Use Microdata Sample via the folktables package, with the ACSEmployment prediction task (binary: employed vs. not employed) and 16 sociodemographic features.

Chunk 113

Training data comprises eight states (TX, MI, PA, OH, IL, GA, NC, VA) across 2016–2018. The base model is logistic regression with standard scaling.

Chunk 114

Post-hoc calibration uses isotonic regression (global) and MCGrad (multicalibration with categorical and numerical segment features) on a held-out calibration set (n ≈644,000). In addition to Classify & Count, Rogan-Gladen, isotonic regression, and MCGrad, we include importance-weighted prevalence estimation (IPW), which estimates density ratios between source and target via logistic regression on all 16 features, and two methods from the quantification literature: PACC (Probabilistic Adjusted Classify & Count) (González et al.

Chunk 115

2017) and SLD (Saerens-Latinne-Decaestecker) (Saerens et al. 2002).

Chunk 116

Uncalibrated averaging is also reported in SI Appendix Table S1. Synthetic age-shifted populations are created by importance-weighted resampling of the test set (n ≈920,000), with exponential weights favoring young ages, old ages, or both extremes.

Chunk 117

RMSE is computed via bootstrap resampling (200 iterations per scenario). Six additional states (CA, NY, FL, WA, AZ, CO) serve as OOD evaluation data.

Chunk 118

Empirical application (CAP). We use data from the Comparative Agendas Project (Baumgartner et al.

Chunk 119

2006), which provides expert-coded policy topic labels for political texts across countries. The binary outcome is whether a document addresses Law & Crime (CAP major topic code 12).

Chunk 120

We sample 30,000 documents (5,000 from each of six sub-populations: Danish parliamentary questions, Spanish oral questions, U.S. congressional bills, Belgian newspaper articles, Spanish media articles, and Belgian TV news).

Chunk 121

The measurement device is Claude Opus 4.6. Each document is classified in two independent campaigns: (1) binary Yes/No classification using the CAP codebook definition of Law & Crime, and (2) direct probability elicitation, where the LLM estimates P(Yes) and P(No) without first committing to an answer.

Chunk 122

The two campaigns were run independently to avoid anchoring contamination. The calibration set (n ≈13,400) draws equally from four sub-populations (Denmark questions, Spain questions, U.S.

Chunk 123

bills, and Belgium newspaper). MCGrad is calibrated with categorical features (country, document type, political party) and two numerical features (decade and document length).

Chunk 124

For the binary-label condition, the LLM’s Yes/No classification is passed to MCGrad as a categorical input feature, with initial scores set to the calibration-set base rate; MCGrad learns feature-conditional prevalence estimates from the metadata and LLM label alone. For the probability-score condition, MCGrad receives P(Yes) as the input score.

Chunk 125

Classify & Count reports the fraction of positive LLM classifications. Rogan-Gladen adjustment corrects this fraction using binary-label TPR and FPR estimated on the calibration set.

Chunk 126

Isotonic regression is fitted on probability scores using the same calibration set. IPW estimates density ratios via logistic regression on country, document type, decade, and document length.

Chunk 127

SLD results are reported in SI Appendix Table S2. Two sub-populations are held out as out-of-distribution targets: Spanish media and Belgian TV news, which differ from the calibration data only in document type while sharing the same countries and languages.

Chunk 128

Software. MCGrad is available at https://github.com/facebookincubator/MCGrad.

Chunk 129

Simulation and analysis code are available at https://github.com/facebookresearch/multicalibrated_llm_measurement. Competing Interests The authors declare no competing interests.

Chunk 130

10 --- Page 11 --- Disclosure of Delegation to Generative AI The authors declare the use of generative AI in the research and writing process. According to the GAIDeT taxonomy (2025), the following tasks were delegated to GAI tools under full human supervision: • Literature search and systematization • Code generation • Code optimization • Data collection • Data cleaning • Data analysis • Visualization • Reproducibility testing • Text generation • Proofreading and editing • Reformatting • Identification of limitations The GAI tools used were: Claude Opus 4.6, Claude Opus 4.7, Gemini 3 Pro.

Chunk 131

Responsibility for the final manuscript lies entirely with the authors. GAI tools are not listed as authors and do not bear responsibility for the final outcomes.

Chunk 132

Declaration submitted by: Fridolin Linder. AI was involved for the listed tasks but no task was exclusively done by AI.

Chunk 133

All outputs were manually verified and iterated on by the authors. Data Availability The American Community Survey data is publicly available via the folktables package.

Chunk 134

The Comparative Agendas Project data is publicly available at https://www.comparativeagendas.net. Simulation and analysis code are available at https://github.com/facebookresearch/multicalibrated_llm_measurement.

Chunk 135

References Angelopoulos, Anastasios N, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. 2023.

Chunk 136

“Prediction-Powered Inference.” Science 382 (6671): 669–74. Baumgartner, Frank R., Christoffer Green-Pedersen, and Bryan D.

Chunk 137

Jones. 2006.

Chunk 138

“Comparative Studies of Policy Agendas.” Journal of European Public Policy 13 (7): 959–74. Benoit, Kenneth, Scott De Marchi, Conor Laver, Michael Laver, and Jinshuai Ma.

Chunk 139

2026. “Using Large Language Models to Analyze Political Texts Through Natural Language Understanding.” American Journal of Political Science, ahead of print.

Chunk 140

https://doi.org/10.1111/ajps.70050. Detommaso, Gianluca, Martin Bertran, Riccardo Fogliato, and Aaron Roth.

Chunk 141

2024. “Multicalibration for Confidence Scoring in LLMs.” Proceedings of the 41st International Conference on Machine Learning (ICML).

Chunk 142

Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023.

Chunk 143

“ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks.” Proceedings of the National Academy of Sciences 120 (30): e2305016120. 11 --- Page 12 --- González, Pablo, Alberto Castaño, Nitesh V.

Chunk 144

Chawla, and Juan José Del Coz. 2017.

Chunk 145

“A Review on Quantification Learning.” ACM Computing Surveys 50 (5): 1–40. Grimmer, Justin, Margaret E.

Chunk 146

Roberts, and Brandon M. Stewart.

Chunk 147

2022. Text as Data: A New Framework for Machine Learning and the Social Sciences.

Chunk 148

Princeton University Press. Hébert-Johnson, Ursula, Michael P.

Chunk 149

Kim, Omer Reingold, and Guy N. Rothblum.

Chunk 150

2018. “Multicalibration: Calibration for the (Computationally-Identifiable) Masses.” Proceedings of the 35th International Conference on Machine Learning (ICML), 1939–48.

Chunk 151

Kadavath, Saurav, Tom Conerly, Amanda Askell, et al. 2022.

Chunk 152

“Language Models (Mostly) Know What They Know.” arXiv Preprint arXiv:2207.05221. Karjus, Andres.

Chunk 153

2025. “Machine-Assisted Quantitizing Designs: Augmenting Humanities and Social Sciences with Artificial Intelligence.” Humanities & Social Sciences Communications.

Chunk 154

Kim, Michael P., Christoph Kern, Shafi Goldwasser, Frauke Kreuter, and Omer Reingold. 2022.

Chunk 155

“Universal Adaptability: Target-Independent Inference That Competes with Propensity Scoring.” Proceedings of the National Academy of Sciences 119 (4): e2108097119. Lee, Robert Y, Kevin S Li, James Sibley, et al.

Chunk 156

2025. “Assessment of a Zero-Shot Large Language Model in Measuring Documented Goals-of-Care Discussions.” Journal of Pain and Symptom Management.

Chunk 157

Maletzke, André G., Denis M. dos Reis, Everton A.

Chunk 158

Cherman, and Gustavo E. A.

Chunk 159

P. A.

Chunk 160

Batista. 2018.

Chunk 161

“On the Need of Class Ratio Insensitive Drift Tests for Data Streams.” Proceedings of the 2nd International Workshop on Learning with Imbalanced Domains: Theory and Applications (LIDTA), 110–24. Mellon, Jonathan, Jack Bailey, Ralph Scott, James Breckwoldt, Marta Miori, and Phillip Schmedeman.

Chunk 162

2024. “Do AIs Know What the Most Important Issue Is?

Chunk 163

Using Language Models to Code Open-Text Social Survey Responses at Scale.” Research & Politics 11 (1): 1–7. Oliveira, Rodrigo de, Matthew Garber, James M.

Chunk 164

Gwinnutt, et al. 2025.

Chunk 165

“A Study of Calibration as a Measurement of Trustworthiness of Large Language Models in Biomedical Natural Language Processing.” JAMIA Open 8 (4). Overos, Henry David, Roman Hlatky, Ojashwi Pathak, et al.

Chunk 166

2024. “Coding with the Machines: Machine- Assisted Coding of Rare Event Data.” PNAS Nexus 3 (5): pgae165.

Chunk 167

Ren, Kevin, Santiago Cortes-Gomez, Carlos Miguel Patiño, et al. 2025.

Chunk 168

“Predicting Language Models’ Suc- cess at Zero-Shot Probabilistic Prediction.” Findings of the Association for Computational Linguistics: EMNLP 2025, 18337–63. Rogan, Walter J., and Beth Gladen.

Chunk 169

1978. “Estimating Prevalence from the Results of a Screening Test.” American Journal of Epidemiology 107 (1): 71–76.

Chunk 170

Saerens, Marco, Patrice Latinne, and Christine Decaestecker. 2002.

Chunk 171

“Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure.” Neural Computation 14 (1): 21–41. Storkey, Amos.

Chunk 172

2009. “When Training and Test Sets Are Different: Characterizing Learning Transfer.” In Dataset Shift in Machine Learning.

Chunk 173

MIT Press. 12 --- Page 13 --- Sushil, Madhumita, Travis Zack, Divneet Mandair, et al.

Chunk 174

2024. “A Comparative Study of Large Language Model-Based Zero-Shot Inference and Task-Specific Supervised Classification of Breast Cancer Pathology Reports.” Journal of the American Medical Informatics Association 31 (10): 2315–27.

Chunk 175

Tax, Niek, Lorenzo Perini, Fridolin Linder, et al. 2026.

Chunk 176

“MCGrad: Multicalibration at Web Scale.” Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Tian, Katherine, Eric Mitchell, Allan Zhou, et al.

Chunk 177

2023. “Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback.” Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Chunk 178

Tojima, Tatsuya, and Mitsuo Yoshida. 2025.

Chunk 179

“Zero-Shot Classification of Art with Large Language Models.” IEEE Access 13: 17426–39. Wang, Xuezhi, Jason Wei, Dale Schuurmans, et al.

Chunk 180

2023. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” Proceedings of the 11th International Conference on Learning Repre- sentations (ICLR).

Chunk 181

Weidmann, Nils B., Mats Faulborn, and David Garcia. 2026.

Chunk 182

“Large Language Models Are Democracy Coders with Attitudes.” PS: Political Science & Politics 59 (1): 17–23. Wu, Siqi, and Paul Resnick.

Chunk 183

2024. “Calibrate-Extrapolate: Rethinking Prevalence Estimation with Black Box Classifiers.” Proceedings of the International AAAI Conference on Web and Social Media (ICWSM) 18: 1634–47.

Chunk 184

Yang, Yahan, Soham Dan, Dan Roth, and Insup Lee. 2023.

Chunk 185

“On the Calibration of Multilingual Question Answering LLMs.” arXiv Preprint arXiv:2311.08669. Yuan, Weizhe, Richard Yuanzhe Pang, Kyunghyun Cho, et al.

Chunk 186

2024. “Self-Rewarding Language Models.” Proceedings of the 41st International Conference on Machine Learning (ICML).

Chunk 187

Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. 2023.

Chunk 188

“Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” Advances in Neural Information Processing Systems (NeurIPS). 13 --- Page 14 --- SI Appendix: Unbiased Prevalence Estimation with Multicalibrated LLMs SI Appendix S1.

Chunk 189

Formal Definitions of Prevalence Estimation Methods This section provides full mathematical definitions of the seven prevalence estimation methods compared in the simulation study. Setup.

Chunk 190

Let h(X) ∈[0, 1] denote the device’s probabilistic prediction for input X, with true label Y ∈{0, 1}. The goal is to estimate the target prevalence π∗= P ∗(Y = 1) using only unlabeled target data {X∗ i }n i=1 and calibration parameters estimated from a labeled source dataset.

Chunk 191

S1.1 Uncalibrated Averaging ˆπraw = 1 n n X i=1 h(X∗ i ) S1.2 Classify & Count Given a threshold τ chosen on calibration data: ˆπCC = 1 n n X i=1 1[h(X∗ i ) ≥τ] In the simulation, τ is chosen so that ˆπCC matches the true prevalence on the calibration set. S1.3 Rogan-Gladen (Adjusted Count) ˆπRG = ˆπCC −[ FPR [ TPR −[ FPR where [ TPR and [ FPR are estimated from calibration data at threshold τ (Rogan and Gladen 1978).

Chunk 192

1 --- Page 15 --- S1.4 Probabilistic Adjusted Classify & Count (PACC) ˆπPACC = ¯h −ˆµ0 ˆµ1 −ˆµ0 where ¯h = 1 n P i h(X∗ i ), ˆµ1 = ˆE[h(X)|Y = 1], and ˆµ0 = ˆE[h(X)|Y = 0] are estimated from calibration data (González et al. 2017).

Chunk 193

S1.5 SLD (EMQ) The Saerens-Latinne-Decaestecker algorithm iterates: 1. Initialize ˆπ(0) from the source prevalence.

Chunk 194

2. E-step: Adjust posteriors for the new prior: ˜h(t) i = (ˆπ(t)/πs) · h(X∗ i ) (ˆπ(t)/πs) · h(X∗ i ) + ((1 −ˆπ(t))/(1 −πs)) · (1 −h(X∗ i )) 3.

Chunk 195

M-step: ˆπ(t+1) = 1 n P i ˜h(t) i 4. Repeat until convergence (Saerens et al.

Chunk 196

2002). S1.6 Global Calibration In the simulation, global calibration applies a multiplicative correction: hcal(X) = c · h(X) where c = ¯Ycal/¯hcal is estimated on calibration data.

Chunk 197

In the empirical applications, global calibration uses isotonic regression. S1.7 Multicalibration In the simulation, which has a single binary covariate, multicalibration reduces to stratum-specific additive corrections: hmc(X) = h(X) + ˆϵg for X ∈stratum g where ˆϵg = ¯Yg −¯hg is estimated on calibration data within each stratum.

Chunk 198

In the empirical applications, we use MCGrad (Tax et al. 2026), a multicalibration algorithm based on gradient boosting.

Chunk 199

MCGrad operates in logit space: given a base predictor f0(X) with logit F0(X) = logit(f0(X)), it iteratively fits gradient boosted decision trees (GBDTs) on the residuals between labels and current predictions. At each round t, a GBDT gt is trained with the current logit predictions as init_score and with the feature matrix consisting of the segment features (categorical and numerical) augmented by the current logit prediction as an additional input feature.

Chunk 200

The logit predictor is then updated as Ft+1(X) = αt · (Ft(X) + gt(X)), where αt is an unshrinkage factor estimated by logistic regression to counteract the GBDT’s learning rate. By including the prediction as a feature, GBDT splits naturally discover miscalibrated regions in the joint space of features and score levels, thereby approximating multicalibration without requiring explicit group specification.

Chunk 201

Early stopping on a validation set prevents overfitting. MCGrad uses LightGBM as the GBDT implementation.

Chunk 202

See Tax et al. (2026) for convergence results and deployment details.

Chunk 203

2 --- Page 16 --- S2. Robustness: Replication with Open-Weight LLM (Llama 3.3 70B) The main text reports results using Claude Opus 4.6 as the LLM measurement device.

Chunk 204

To verify that the findings are not specific to a particular model, we replicate the CAP analysis using Llama 3.3 70B Instruct (Grattafiori et al. 2024) (4-bit NF4 quantized, run on a single A100 80GB GPU).

Chunk 205

This section reports results using two score extraction methods: token log-probabilities and verbalized confidence elicitation. S2.1 Score Extraction Methods Log-probabilities.

Chunk 206

For each document, the model is prompted with the CAP codebook definition of Law & Crime and asked to respond Yes or No. The score is extracted from next-token log- probabilities: h(X) = P(Yes)/(P(Yes) + P(No)).

Chunk 207

This produces highly bimodal scores: 23% of the 105,000 documents score at exactly 0.0 or 1.0, and only 6% fall in the mid-range [0.1, 0.9]. Because MCGrad’s internal logit transform maps values near 0 and 1 to ±∞, a linear squashing transformation h′(X) = ϵ + (1 −2ϵ) · h(X) with ϵ = 0.05 is applied before fitting MCGrad.

Chunk 208

Verbalized confidence (2-stage). A two-stage dialogue first asks the model to classify the document (Yes/No), then asks it to estimate the probability that its answer is correct, with an anti-certainty instruction (“Note: very few things are 0% or 100% certain”) to discourage degenerate outputs (Tian et al.

Chunk 209

2023). The score is P(correct) if the answer is Yes and 1 −P(correct) if No.

Chunk 210

This produces scores in [0.01, 0.99] with negligible boundary mass and 11 unique score values. No squashing is required.

Chunk 211

S2.2 Data and Calibration The Llama analysis uses the full 105,000-document sample (15,000 per sub-population for Denmark questions, Spain questions, U.S. bills, and Belgium newspaper; 30,000 for Spanish media; 15,000 for Belgian TV).

Chunk 212

The calibration set (n ≈40,000) is drawn equally from the four in-distribution sub-populations. MCGrad is calibrated with categorical features (country, document type, party) and one numerical feature (decade).

Chunk 213

S2.3 Results: Verbalized Confidence Scores Table S3 shows prevalence estimation bias using Llama 3.3 70B with verbalized confidence scores. Scenario Shift Type True Prev.

Chunk 214

CC RG IPW Iso. MCGrad Baseline None 8.1% +14.7 +0.6 +0.1 +0.2 +0.2 Country shift Within- cal.

Chunk 215

8.7% +15.6 +2.2 -0.3 +1.3 +0.1 Doc-type shift Within- cal. 6.5% +16.0 +1.7 +0.0 +0.7 +0.1 Spain media OOD doc type 19.3% +15.4 +6.6 -9.8 -7.2 -4.9 3 --- Page 17 --- Scenario Shift Type True Prev.

Chunk 216

CC RG IPW Iso. MCGrad Belgium TV OOD doc type 11.1% +13.3 -0.0 -3.5 -1.6 -3.4 Table S3: Prevalence estimation bias (pp) for Law & Crime topic using Llama 3.3 70B with verbalized confidence scores.

Chunk 217

CC = Classify & Count, RG = Rogan-Gladen, IPW = importance-weighted estimation, Iso. = isotonic regression.

Chunk 218

The pattern is consistent with the main text’s Claude Opus results: MCGrad achieves near-zero bias within the calibration distribution (≤0.2pp) and degrades on OOD populations (-3.4 to -4.9pp). Several differences are notable: • Higher raw CC bias (+14-16pp vs.

Chunk 219

+2-5pp with Opus). • Comparable MCGrad within-calibration performance (≤0.2pp for both models), confirming that multicalibration corrects for model-specific calibration errors.

Chunk 220

• Larger OOD bias on Spanish media (-4.9pp vs. -2.5pp with Opus binary labels), reflecting the combination of a weaker base model with the coarser verbalized score distribution (11 unique values vs.

Chunk 221

Opus’s 43). S2.4 Score Distribution: Log-Probabilities vs.

Chunk 222

Verbalized Confidence The bimodal distribution of Llama’s log-probability scores illustrates a broader challenge for LLM- based measurement. RLHF-tuned instruction-following models tend to produce highly confident outputs, pushing token probabilities toward 0 or 1.

Chunk 223

This creates two problems for prevalence estimation: (1) the scores carry little information about uncertainty, producing large raw bias even at baseline (+18pp), and (2) post-hoc calibration methods that operate in logit space (including MCGrad) require score preprocessing to avoid numerical instability. Verbalized confidence elicitation partially addresses both problems by producing scores that are better distributed (75% in [0.1, 0.9]) and better calibrated out of the box (log loss 0.525 vs.

Chunk 224

1.707 for log-probabilities). However, the scores remain coarsely discretized (11 unique values), and as shown in both the Llama and Opus analyses, the quality of the input scores matters less than the metadata features for MCGrad’s prevalence estimation performance under shift.

Chunk 225

S2.5 Additional Baselines: SLD and PACC on Llama Scores The SLD (EMQ) algorithm, designed for label shift rather than covariate shift, diverges catastroph- ically on Llama’s verbalized confidence scores, producing prevalence estimates biased by +33 to +60pp. This occurs because the verbalized scores are not calibrated posteriors, violating SLD’s core assumption.

Chunk 226

PACC shows moderate bias (+0.6 to +5.9pp within calibration, +2.4 to +5.9pp OOD). Full results including SLD and PACC are available in the replication code.

Chunk 227

S3. Detailed Results Tables Table S1: ACS Employment Prevalence Estimation Bias 4 --- Page 18 --- Setting Age Dist.

Chunk 228

True Prev. Raw CC RG PACC SLD IPW Iso.

Chunk 229

MCGrad In- Dist Original 46.0% -0.31 -0.07 +0.26 -0.07 +0.01 -0.3 -0.30 -0.27 In- Dist Young- skewed 12.8% +1.93 +2.47 -12.82 -12.82 -11.95 -0.3 +2.04 -0.11 In- Dist Old- skewed 16.8% +7.23 -6.62 -16.77 -16.77 -16.76 -1.2 +6.65 +0.22 In- Dist Bimodal 21.1% +4.57 -1.50 -18.62 -19.97 -16.14 +4.7 +4.33 +0.12 OOD Original 45.1% +1.15 +1.40 +2.12 +2.13 +2.25 +1.7 +1.17 +1.35 OOD Young- skewed 13.0% +2.93 +3.73 -12.96 -12.96 -11.38 +0.8 +3.08 +0.88 OOD Old- skewed 16.0% +8.47 -5.64 -15.97 -15.97 -15.97 +0.1 +7.91 +1.01 OOD Bimodal 20.8% +5.91 +0.09 -16.14 -17.27 -15.20 +6.3 +5.69 +1.13 Prevalence estimation bias in percentage points (pp) under synthetic age distribution shift. Raw = uncalibrated averaging, CC = Classify & Count, RG = Rogan-Gladen, IPW = importance-weighted prevalence estimation, Iso.

Chunk 230

= Isotonic regression. Bootstrap RMSE (200 iterations) closely tracks absolute bias in all scenarios.

Chunk 231

Table S2: CAP Law & Crime Prevalence Estimation Bias (Claude Opus 4.6) Scenario Shift Type True Prev. CC RG SLD IPW Iso.

Chunk 232

MC (bi- nary) MC (scores) Baseline None 7.9% +2.2 +0.5 +7.4 +0.1 +0.1 +0.1 +0.2 Country shift Within- cal. 8.4% +3.3 +1.7 +9.5 +0.1 +0.9 +0.4 +0.4 Doc- type shift Within- cal.

Chunk 233

6.3% +1.6 -0.3 +4.9 +0.1 +0.0 -0.0 +0.1 Spain media OOD doc type 19.5% +3.6 +3.1 +21.6 -12.2 -2.5 -1.9 -4.5 Belgium TV OOD doc type 11.1% +4.8 +3.7 +13.7 -4.5 +2.6 +0.7 +1.6 CC = Classify & Count (fraction of Yes labels); RG = Rogan-Gladen adjustment on binary labels; SLD = Saerens-Latinne-Decaestecker (label shift, applied to probability scores); IPW = importance- weighted estimation (target-specific density ratio); Iso. = isotonic regression on probability scores; MC (binary) = MCGrad on binary labels with base-rate initialization; MC (scores) = MCGrad on probability scores.

Chunk 234

5 --- Page 19 --- S4. Simulation: RMSE Figure S2: Root mean squared error (RMSE) under covariate shift for the same four methods shown in Figure 1, averaged over 50 simulation runs.

Chunk 235

RMSE closely tracks absolute bias for all methods, confirming that variance is small relative to bias at this sample size. MCGrad maintains the lowest RMSE across all shift levels.

Chunk 236

S5. Simulation: All Methods Figure S3: Simulation bias curves for all seven methods.

Chunk 237

Rogan-Gladen and PACC exhibit catas- trophic failure (bias exceeding -200% at extreme shifts). SLD shows large bias under covariate shift because it assumes label shift.

Chunk 238

Uncalibrated averaging shows moderate bias. MCGrad maintains near-zero bias throughout.

Chunk 239

6 --- Page 20 --- S6. Claude Opus Score Distribution Figure S4: Claude Opus 4.6 P(Yes) score distribution by label across six CAP sub-populations.

Chunk 240

Scores are well-separated (mean 0.75 for positives vs. 0.07 for negatives) with 43 unique values and no boundary mass.

Chunk 241

References González, Pablo, Alberto Castaño, Nitesh V. Chawla, and Juan José Del Coz.

Chunk 242

2017. “A Review on Quantification Learning.” ACM Computing Surveys 50 (5): 1–40.

Chunk 243

Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024.

Chunk 244

“The Llama 3 Herd of Models.” arXiv Preprint arXiv:2407.21783. Rogan, Walter J., and Beth Gladen.

Chunk 245

1978. “Estimating Prevalence from the Results of a Screening Test.” American Journal of Epidemiology 107 (1): 71–76.

Chunk 246

Saerens, Marco, Patrice Latinne, and Christine Decaestecker. 2002.

Chunk 247

“Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure.” Neural Computation 14 (1): 21–41. Tax, Niek, Lorenzo Perini, Fridolin Linder, et al.

Chunk 248

2026. “MCGrad: Multicalibration at Web Scale.” Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).

Chunk 249

Tian, Katherine, Eric Mitchell, Allan Zhou, et al. 2023.

Chunk 250

“Just Ask for Calibration: Strategies 7 --- Page 21 --- for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback.” Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8