Chunk 0
--- Page 1 ---
CoFEE: Reasoning Control for LLM-Based
Feature Discovery
Maximilian Westermann∗, Ben Griffin∗, Aaron Ontoyin Yin†, Zakari Salifu†, Yagiz Ihlamur‡, Kelvin Amoaba†,
Joseph Ternasky†, Fuat Alican†, Yigit Ihlamur†
∗University of Oxford †Vela Research ‡Amazon
Abstract—Feature discovery from complex unstructured data
is fundamentally a reasoning problem: it requires identifying
abstractions that are predictive of a target outcome while
avoiding leakage, proxies, and post-outcome signals. With the
introduction of ever-improving Large Language Models (LLMs),
our method provides a structured method for addressing this
challenge.
Chunk 1
LLMs are well suited for this task by being able to
process large amounts of information, but unconstrained feature
generation can lead to weak features. In this work, we study
reasoning control in LLMs by inducing cognitive behaviors for
improving feature discovery.
Chunk 2
We introduce CoFEE (Cognitive
Feature Engineering Engine), a reasoning control framework that
enforces cognitive behaviors in how the LLM reasons during
feature discovery. From a machine learning perspective, these
cognitive behaviors act as structured inductive biases over the
space of candidate features generated by the model.
Chunk 3
These
behaviors have been exploited with success in ML models, and
include backward chaining from outcomes, subgoal decomposi-
tion, verification against observability and leakage criteria, and
explicit backtracking of rejected reasoning paths. CoFEE does
not modify model architecture, training, or inference; reasoning
control is implemented entirely through structured prompts.
Chunk 4
In a controlled comparison, we show that enforcing cognitive
behaviors yields features with higher empirical predictability
than those under unconstrained vanilla LLM prompts. CoFEE
achieves an average Success Rate Score that is 15.2% higher than
the vanilla approach, while generating 29% fewer features and
reducing costs by 53.3%.
Chunk 5
Using held-out feature evaluation, we
assess whether cognitively induced features generalize beyond the
data used for discovery. Our results indicate that, in our evaluated
setting, reasoning control is associated with improvements in
quality and efficiency of LLM-based feature discovery.
Chunk 6
Index Terms—Reasoning Control, Feature Engineering, Induc-
tive Bias
I. INTRODUCTION
F
EATURE discovery is a critical step in analytical
pipelines, particularly in domains that rely heavily on
unstructured or complex data such as venture capital (VC).
Chunk 7
In
these settings, relevant signals are often implicit, non-linear,
and distributed across unstructured sources, making manual
feature construction difficult. While interpretable, rule-based models such as linear mod-
els and decision trees offer transparent, auditable reasoning,
they often struggle in capturing the complexity of available
data [1].
Chunk 8
This leads to the inability to maximize the predictive
This work was done outside of the current role that Yagiz Ihlamur
holds at Amazon. Correspondence: Maximilian Westermann <maximil-
ian.westermann@wadham.ox.ac.uk>.
Chunk 9
Fig. 1.
Chunk 10
Overview of the CoFEE pipeline. effectiveness of features.
Chunk 11
Recent work such as GPTree ad-
dresses this limitation partially by integrating large language
models (LLMs) into a decision tree, allowing transparent
and interpretable decision tree reasoning to provide valuable
information for VC decision-making [1]. A complementary challenge, feature engineering, arises
upstream of model construction.
Chunk 12
Previous work has shown
that well-structured features can improve predictability across
learning algorithms [2], yet feature discovery remains a major
bottleneck, especially in the extraction of unstructured data. LLMs offer automated feature discovery as a potential solu-
tion, with their ability to interpret large amounts of data using
reasoning [3].
Chunk 13
However, insufficient prompting of LLMs to
propose features can produce features with limited predictabil-
ity [4]. This raises a key question: How can reasoning behaviors
induced by cognitive prompting act as inductive biases to pro-
duce highly predictive and interpretable features using LLMs?
Chunk 14
Recent work has identified a set of reasoning behaviors such
as backward chaining, subgoal decomposition, verification,
and backtracking that correlate with improved learning and
self-improvement in language models [5], [6]. Complementary
arXiv:2604.21584v1 [cs.AI] 23 Apr 2026
--- Page 2 ---
2
Fig.
Chunk 15
2. CoFEE pipeline.
Chunk 16
Agent 1 performs cognitive feature selection to construct an initial master list, which is refined via semantic similarity by Agent 2. Agent 3 then scores each feature by counting the number of successful and unsuccessful founders exhibiting it.
Chunk 17
research further supports this perspective, arguing that AI
evaluation should mirror human testing practices by explicitly
evaluating reasoning processes rather than evaluating only
final outputs [7]. In LLMs, such behaviors can be instantiated
through prompt-level structure, motivating cognitive prompt-
ing as a practical control interface over model reasoning.
Chunk 18
Motivated by these findings, we introduce CoFEE (Cognitive
Feature Engineering Engine), an agent-based pipeline that is
used to produce features (Fig. 1).
Chunk 19
Analogous to how GPTree
structures LLM reasoning with decision trees, CoFEE struc-
tures LLM reasoning upstream during feature discovery. Using
CoFEE, we conduct a controlled empirical study comparing
the same pipeline, with one version using cognitive feature
discovery carried out by an agent using cognitive prompting,
and a baseline using vanilla GPT-5.2 for feature discovery.
Chunk 20
In this work, we find that cognitive prompting consistently
produces features with higher predictability scores. These
findings suggest that prompting cognitive behaviors can serve
as an effective design strategy for LLM-based feature engi-
neering.
Chunk 21
II. REASONING CONTROL VIA COGNITIVE CONSTRAINTS
We use the term reasoning control to describe how we
constrain LLMs’ reasoning behavior, implemented in our
framework through cognitive prompting.
Chunk 22
In machine learning
terms, this corresponds to imposing structured inductive biases
on the feature generation process. These constraints are imple-
mented using prompts, enforcing discipline during generation.
Chunk 23
In CoFEE, reasoning control is implemented through four
cognitive behavior constraints as described in Gandhi et al. [5]:
• Backward chaining: Start from the desired outcome and
reason backward.
Chunk 24
• Subgoal decomposition: Break down complex tasks into
smaller, manageable steps. • Verification: Systematically check the results of each
intermediate step.
Chunk 25
• Backtracking: Revise steps explicitly when they fail. III.
Chunk 26
DATASET
To validate the effectiveness of inducing cognitive behaviors
in LLMs, we use a dataset consisting of 1,000 founder profiles
collected from publicly available sources. This dataset contains
information about the founder and company (e.g., background,
roles, sector, funding, etc.), and the success outcome.
Chunk 27
It
consists of 400 (40%) successful and 600 (60%) unsuccessful
founders. We classify ’successful’ founders as those that
achieve an M&A or IPO valuation exceeding $500M, or raised
more than $500M in total funding [8].
Chunk 28
A. Held-Out Feature Evaluation
To assess whether these features discovered in this pipeline
generalize beyond the data used for discovery, we carry out
a held-out feature evaluation following standard ML practice.
Chunk 29
Rather than evaluating on the same data, we explicitly assess
these features on a held-out dataset for feature evaluation. 1,000 founders are used for discovery, and we use another
1,000 founders (again with 40% successful) to assess feature
performance.
Chunk 30
The discovery set is passed through the pipeline
for discovery and refinement. The held-out evaluation set is
never exposed to the feature discovery pipeline.
Chunk 31
Once discovery is completed, the features are frozen. These
frozen features are then applied to the held-out evaluation set,
where feature predictive quality is assessed.
Chunk 32
This ensures that
no information in the held-out set influences feature discovery,
allowing this evaluation to assess feature generalization. IV.
Chunk 33
PIPELINE OVERVIEW
In this work, we present CoFEE, an agent-based pipeline
that explicitly enforces structured cognitive behavior during
feature discovery. While models such as GPT-5.2 already ex-
hibit certain reasoning behaviors, previous studies suggest that
these behaviors can be further strengthened and systematically
induced through explicit prompting and structural constraints.
Chunk 34
With CoFEE, we explore the feasibility of leveraging these
enforced cognitive qualities to enhance feature discovery com-
pared to vanilla GPT-5.2. The pipeline (see Fig.
Chunk 35
2) breaks
--- Page 3 ---
3
down this process into three specialized agents (for discovery,
scoring, refining) using GPT-5.2 for each agent in this pipeline. Viewed as an ML system, CoFEE implements a genera-
tor–evaluator loop in which cognitively constrained reasoning
guides feature hypothesis generation prior to evaluation.
Chunk 36
A. Agent 1: Feature Discovery
Agent 1 is the primary agent for this pipeline and is respon-
sible for proposing candidate features based on the provided
dataset.
Chunk 37
The agent receives structured prompts that explicitly
induce cognitive behaviors, including backward chaining, sub-
goal decomposition, verification, and backtracking. Backward
chaining is used to reason from the target outcome to features;
subgoal decomposition structures feature discovery around
high-level causal categories; verification enforces observabil-
ity and non-proxy constraints; and backtracking records and
rejects invalid reasoning paths.
Chunk 38
In the vanilla GPT-5.2, Agent
1 proposes features without these cognitive constraints. This agent processes the dataset 50 founders at a time,
extracting features and adding them to a master list until the
entire dataset is analyzed.
Chunk 39
The initial prompts for Agent 1 of CoFEE and vanilla
prompting appear below. Full prompts are provided in the
Supplement.
Chunk 40
a) CoFEE Prompt (Agent 1):
"You are Agent 1: a stateless
Feature Discovery agent. You are given a batch of 50 founder
records.
Chunk 41
You do NOT know which founders are
successful or unsuccessful. You
have NO memory of previous batches.
Chunk 42
Your task is to propose candidate
FEATURES that could plausibly
distinguish successful from
unsuccessful founders. You are performing Cognitive
Feature Reasoning.
Chunk 43
You must
explicitly apply the following
cognitive behaviors. You must
produce structured outputs and make
explicit decisions.
Chunk 44
--------------------------------
1. BACKWARD CHAINING
--------------------------------
Start from system-level success or
failure.
Chunk 45
For each proposed mechanism:
- State the causal hypothesis
explicitly. - Explain why this
mechanism would operate *before*
success.
Chunk 46
- Map the mechanism to at
least one measurable or inferable
quantity available in the dataset. You may reason about hidden
variables, but the final feature
MUST be: - observable pre-success -
expressible in deterministic logic
If a mechanism cannot be mapped to
an observable feature, abandon it.
Chunk 47
--------------------------------
2. SUBGOAL SETTING
--------------------------------
Organize exploration into NO MORE
THAN 4 subgoals chosen from: -
founder capability formation - team
coordination and complementarity -
market structure and constraints -
early execution dynamics
For each subgoal: - List candidate
mechanisms.
Chunk 48
- Maintain the
hierarchy: system behavior →
mechanism →feature. If a subgoal: - collapses into a
proxy - fails observability - has
ambiguous causal direction
then explicitly ABANDON or REVISE
the subgoal and explain why.
Chunk 49
--------------------------------
3. VERIFICATION
--------------------------------
For each proposed feature, verify:
- it is observable before the
success outcome - it encodes a
plausible causal mechanism - it is
not a prestige-based, descriptive,
or post-outcome proxy
For each feature, list: - potential
bias sources - uncertainty or
ambiguity
If verification fails, reject the
feature.
Chunk 50
--------------------------------
4. BACKTRACKING
--------------------------------
Explicitly record every abandoned
reasoning path.
Chunk 51
For each abandoned path, record: -
why it initially seemed promising -
which constraint caused rejection
(proxy risk, leakage, observability
failure, causal ambiguity)
Use these abandoned paths to
bias future exploration away from
similar dead ends. --------------------------------
b) Vanilla Prompt (Agent 1):
"You are Agent 1: a stateless
Feature Discovery agent.
Chunk 52
You are given a batch of 50 founder
records. You do NOT know which founders are
successful or unsuccessful.
Chunk 53
You
have NO memory of previous batches. Your task is to propose candidate
--- Page 4 ---
4
FEATURES that could plausibly
distinguish successful from
unsuccessful founders.
Chunk 54
Features must be observable
pre-success
You must produce structured outputs
and make explicit decisions."
B. Agent 2: Feature Consolidation
Agent 2 identifies semantically overlapping features and
merges those that represent the same underlying mechanism as
demonstrated in Fig.
Chunk 55
3. Merging decisions are conservative and
preserve feature provenance to maintain interpretability.
Chunk 56
This
step reduces redundancy while ensuring that distinct causal
mechanisms remain separate. Fig.
Chunk 57
3. Diagram illustrating the Agent 2 process, in which semantically
similar features are compared, their similarity is justified, and the features
are combined.
Chunk 58
C. Agent 3: Scoring
Agent 3 evaluates features by comparing feature name and
definitions against founder records in batches of up to 100
features and 1,000 founders at a time.
Chunk 59
When a feature matches
a founder, the founder is tagged with that feature. After all
founders have been tagged, Agent 3 outputs a JSON file
recording feature assignments.
Chunk 60
This output is then used to deterministically compute feature
statistics. n1 denoting the number of successful founders
exhibiting a given feature, and n0 denoting the number of
unsuccessful founders exhibiting the feature.
Chunk 61
We define the
success-rate delta (∆SR) as the difference between the success
probability conditioned on feature presence and the success
probability conditioned on feature absence:
∆SR =
n1
n1 + n0
−
(N1 −n1)
(N1 −n1) + (N0 −n0)
(1)
where N1 and N0 denote the total numbers of successful
and unsuccessful founders in the dataset, respectively. V.
Chunk 62
EXPERIMENT
We evaluate the impact of cognitively structured prompting
on feature discovery using a controlled experimental setup. We compare two feature discovery conditions:
• Cognitive Prompting: where Agent 1 is constrained
to explicitly apply the cognitive behaviors: backward
chaining, subgoal decomposition, verification, and back-
tracking.
Chunk 63
• Vanilla Prompting: where Agent 1 is prompted to pro-
pose features without explicit cognitive constraints. Between the two conditions, the remainder of the pipeline
components are held constant (Agents 2–3).
Chunk 64
This ensures that
any observable differences can be attributed to differences in
Agent 1 prompting. A.
Chunk 65
Evaluation Metric
For each feature, we compute a success-rate delta (∆SR),
which measures the difference in success rates between
founders who exhibit the feature and those who do not. This is computed by comparing the prevalence of the feature
among successful founders to its prevalence among unsuc-
cessful founders, with larger differences indicating stronger
discriminative power.
Chunk 66
This metric allows us to quantify how
effectively a given feature set separates successful founders
from unsuccessful ones. To evaluate whether cognitive prompting improves feature
quality, we compare the top ten features ranked by ∆SR
for each experimental condition.
Chunk 67
By examining differences in
these top-ranked features, we assess whether cognitive prompt-
ing leads to systematically stronger or more discriminative
features. VI.
Chunk 68
RESULTS
A. Feature Discovery Results
To ensure statistical stability and modeling relevance, we
restrict analysis to features with marginal support ≥10% of
the sample (n ≥100).
Chunk 69
For a binomial proportion with n = 100,
the maximum standard error (attained at p = 0.5) is 0.05. This
threshold filters high-variance rare features that are unlikely
to generalize out-of-sample and concentrates evaluation on
predictors with meaningful population coverage.
Chunk 70
The top 10
features discovered using CoFEE (cognitive prompting) are
shown in Table I. The top 10 features discovered using vanilla
prompting are shown in Table II.
Chunk 71
Finally, the side-by-side
comparison of features discovered by CoFEE and vanilla
prompting are shown in Table III. Further metric comparisons
are presented in the Appendix.
Chunk 72
TABLE I
TOP COFEE FEATURES RANKED BY SUCCESS-RATE DELTA (∆SR)
Feature ID
n1
n0
∆SR
top_university_education_flag
82
37
0.328
education_top10_qs_flag
67
33
0.300
highest_degree_level
304
292
0.272
education_top50_qs_flag
70
41
0.259
technical_background_flag
190
155
0.230
functional_role_diversity
241
222
0.224
job_count_total
254
241
0.224
job_tenure_longest_bucket
284
289
0.224
cross_industry_breadth_count
252
239
0.222
functional_breadth_score
245
229
0.222
--- Page 5 ---
5
TABLE II
TOP VANILLA FEATURES RANKED BY SUCCESS-RATE DELTA (∆SR)
Feature ID
n1
n0
∆SR
has_top10_qs_education
81
34
0.344
data_completeness_score
281
277
0.234
max_role_seniority_level
269
269
0.216
education_qs_rank_best_numeric
217
201
0.205
has_senior_executive_role
209
192
0.202
functional_background_primary
282
302
0.199
role_seniority_score_max
262
273
0.193
founder_tenure_years_estimate
144
122
0.193
max_seniority_level
263
275
0.192
has_any_industry_match_prior_job_flag
116
97
0.184
TABLE III
COMPARISON OF FEATURE QUALITY AND EFFICIENCY BETWEEN COFEE
AND VANILLA PROMPTING. FEATURE QUALITY IS EVALUATED USING
SUCCESS-RATE DELTA (∆SR) ON HELD-OUT DATA.
Chunk 73
Metric
CoFEE
Vanilla
Mean ∆SR (Top-10)
0.250
0.217
Median ∆SR
0.227
0.204
Total Features Generated
157
222
Total Cost (USD)
$8.54
$18.29
B. Feature Discovery Comparison and Interpretation
Overall, CoFEE produces a broader distribution of highly
predictive features, with consistently higher ∆SR values
across the top-ranked features compared to vanilla prompting.
Chunk 74
VII. DISCUSSION
The results of this study suggest that explicitly structur-
ing cognitive prompts in an LLM, specifically GPT-5.2, for
feature discovery has a measurable impact on the quality
of the resulting features.
Chunk 75
When cognitive behaviors such as
backward chaining, subgoal decomposition, verification, and
backtracking are enforced, the discovered features achieve
higher predictability scores than those generated via vanilla
prompting. Because all other pipeline components are held
constant, the observed improvements are consistent with the
presence of cognitive structure in feature discovery.
Chunk 76
This indi-
cates that cognitive prompting provides a structured constraint
on LLM feature generation behavior. Comparing costs across the two approaches, the CoFEE
evaluation incurred a total cost of $8.54, whereas the vanilla
evaluation cost $18.29.
Chunk 77
The higher cost of the vanilla approach
is attributable to the larger number of features it produced (222
features), compared to CoFEE’s 157 features. The increased
feature count in the vanilla setting resulted in a greater number
of API calls during the feature scoring and merging stages.
Chunk 78
These results suggest that inducing cognitive behaviors via
prompting functions as an effective inductive bias in LLM-
based feature discovery systems. Despite the empirical improvements observed, important
limitations remain.
Chunk 79
First, the results are evaluated on a sin-
gle domain, and do not establish generalization beyond this
setting. Second, ∆SR captures empirical predictability differ-
ences but does not directly assess downstream task perfor-
mance.
Chunk 80
Third, the induced cognitive behavior prompts and the
robustness of these effects across alternative models, prompt-
ing formulations, and scales have not yet been evaluated. Future work will evaluate CoFEE across multiple domains,
measure downstream model performance when incorporating
discovered features, and examine robustness across different
model architectures and prompt structures.
Chunk 81
VIII. CONCLUSION
CoFEE provides empirical evidence that explicit reasoning
control can improve LLM-based feature discovery while re-
ducing costs by 53.3% in our evaluated setting.
Chunk 82
By inducing
structured cognitive behaviors, the pipeline produces features
with higher ∆SR values obtained at lower computational cost
relative to the vanilla prompt baseline. These findings suggest
that reasoning control may serve as a practical pipeline design
strategy for LLM-based analytical systems.
Chunk 83
REFERENCES
[1] S. Xiong, Y.
Chunk 84
Ihlamur, F. Alican, and A.
Chunk 85
O. Yin, “GPTree: Towards
explainable decision-making via LLM-powered decision trees,” 2024.
Chunk 86
[2] J. Heaton, “An empirical analysis of feature engineering for predictive
modeling,” 2016.
Chunk 87
[3] Y. Wang, S.
Chunk 88
Wu, Y. Zhang, W.
Chunk 89
Wang, Z. Liu, J.
Chunk 90
Luo, and H. Fei,
“Multimodal chain-of-thought reasoning: A comprehensive survey,” 2025.
Chunk 91
[4] J. Wu, M.
Chunk 92
Feng, S. Zhang, F.
Chunk 93
Lv, R. Jin, F.
Chunk 94
Che, Z. Wen, and J.
Chunk 95
Tao,
“Boosting multimodal reasoning with automated structured thinking,”
2025. [5] K.
Chunk 96
Gandhi, A. Chakravarthy, A.
Chunk 97
Singh, N. Lile, and N.
Chunk 98
D. Goodman,
“Cognitive behaviors that enable self-improving reasoners: Four habits
of highly effective STaRs,” 2025.
Chunk 99
[6] S. Yao, J.
Chunk 100
Zhao, D. Yu, N.
Chunk 101
Du, I. Shafran, K.
Chunk 102
Narasimhan, and Y. Cao,
“ReAct: Synergizing reasoning and acting in language models,” 2022.
Chunk 103
[7] Y. Zhuang, Q.
Chunk 104
Liu, Z. A.
Chunk 105
Pardos, P. C.
Chunk 106
Kyllonen, J. Zu, Z.
Chunk 107
Huang, S. Wang,
and E.
Chunk 108
Chen, “Position: AI evaluation should learn from how we test
humans,” 2024. [8] R.
Chunk 109
Chen, J. Ternasky, A.
Chunk 110
S. Kwesi, B.
Chunk 111
Griffin, A. O.
Chunk 112
Yin, Z. Salifu,
K.
Chunk 113
Amoaba, X. Mu, F.
Chunk 114
Alican, and Y. Ihlamur, “VCBench: Benchmarking
LLMs in venture capital,” 2025.
Chunk 115
IX. APPENDIX
Two tables included comparing feature quality metrics for
the two discovery approaches.
Chunk 116
Precision denotes the condi-
tional success probability among founders exhibiting the fea-
ture, P(Y = 1 | f = 1). Support represents the proportion of
the dataset exhibiting the feature, (n1+n0)/(N1+N0), reflect-
ing population coverage.
Chunk 117
Lift measures relative improvement
over the baseline success rate and is defined as Precision/0.4,
where 0.4 is the global base rate of success in the dataset. TABLE IV
TOP COFEE FEATURES WITH EXTENDED METRICS
Feature
n1
n0
Precision
∆SR
Lift
Support
F1
82
37
0.689
0.328
1.723
0.119
F2
67
33
0.670
0.300
1.675
0.100
F3
304
292
0.510
0.272
1.275
0.596
F4
70
41
0.631
0.259
1.577
0.111
F5
190
155
0.551
0.230
1.377
0.345
F6
241
222
0.521
0.224
1.301
0.463
F7
254
241
0.513
0.224
1.283
0.495
F8
284
289
0.496
0.224
1.239
0.573
F9
252
239
0.513
0.222
1.283
0.491
F10
245
229
0.517
0.222
1.292
0.474
--- Page 6 ---
6
TABLE V
TOP VANILLA FEATURES WITH EXTENDED METRICS
Feature
n1
n0
Precision
∆SR
Lift
Support
F1
81
34
0.704
0.344
1.761
0.115
F2
281
277
0.504
0.234
1.259
0.558
F3
269
269
0.500
0.216
1.250
0.538
F4
217
201
0.519
0.205
1.298
0.418
F5
209
192
0.521
0.202
1.303
0.401
F6
282
302
0.483
0.199
1.207
0.584
F7
262
273
0.490
0.193
1.224
0.535
F8
144
122
0.541
0.193
1.353
0.266
F9
263
275
0.489
0.192
1.222
0.538
F10
116
97
0.545
0.184
1.362
0.213
X.
Chunk 118
SUPPLEMENTARY
To ensure reproducibility and structured evaluation of fea-
ture discovery outputs, we provide the exact JSON schema
used by Agent 1. This template captures not only the final fea-
ture specification, but also the intermediate reasoning structure
(subgoals, causal mechanisms, and abandoned ideas), enabling
inspection of reasoning control effects.
Chunk 119
1
{
2
"batch_id": "string",
3
"features": [
4
{
5
"feature_name": "string",
6
"subgoal": "string",
7
"causal_mechanism": "string",
8
"definition": "string",
9
"computation_logic": "string",
10
"abandoned_ideas": [
11
{
12
"idea": "string",
13
"reason": "string"
14
}
15
]
16
}
17
]
18
}