Research Papers

Cofee, Reasoning Control For Llm Based Feature Discovery

Document ID: research-papers-cofee-reasoning-control-for-llm-based-feature-discovery

Full content

--- Page 1 --- CoFEE: Reasoning Control for LLM-Based Feature Discovery Maximilian Westermann∗, Ben Griffin∗, Aaron Ontoyin Yin†, Zakari Salifu†, Yagiz Ihlamur‡, Kelvin Amoaba†, Joseph Ternasky†, Fuat Alican†, Yigit Ihlamur† ∗University of Oxford †Vela Research ‡Amazon Abstract—Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery. We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model. These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposi- tion, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. CoFEE does not modify model architecture, training, or inference; reasoning control is implemented entirely through structured prompts. In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%. Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery. Index Terms—Reasoning Control, Feature Engineering, Induc- tive Bias I. INTRODUCTION F EATURE discovery is a critical step in analytical pipelines, particularly in domains that rely heavily on unstructured or complex data such as venture capital (VC). In these settings, relevant signals are often implicit, non-linear, and distributed across unstructured sources, making manual feature construction difficult. While interpretable, rule-based models such as linear mod- els and decision trees offer transparent, auditable reasoning, they often struggle in capturing the complexity of available data [1]. This leads to the inability to maximize the predictive This work was done outside of the current role that Yagiz Ihlamur holds at Amazon. Correspondence: Maximilian Westermann <maximil- ian.westermann@wadham.ox.ac.uk>. Fig. 1. Overview of the CoFEE pipeline. effectiveness of features. Recent work such as GPTree ad- dresses this limitation partially by integrating large language models (LLMs) into a decision tree, allowing transparent and interpretable decision tree reasoning to provide valuable information for VC decision-making [1]. A complementary challenge, feature engineering, arises upstream of model construction. Previous work has shown that well-structured features can improve predictability across learning algorithms [2], yet feature discovery remains a major bottleneck, especially in the extraction of unstructured data. LLMs offer automated feature discovery as a potential solu- tion, with their ability to interpret large amounts of data using reasoning [3]. However, insufficient prompting of LLMs to propose features can produce features with limited predictabil- ity [4]. This raises a key question: How can reasoning behaviors induced by cognitive prompting act as inductive biases to pro- duce highly predictive and interpretable features using LLMs? Recent work has identified a set of reasoning behaviors such as backward chaining, subgoal decomposition, verification, and backtracking that correlate with improved learning and self-improvement in language models [5], [6]. Complementary arXiv:2604.21584v1 [cs.AI] 23 Apr 2026 --- Page 2 --- 2 Fig. 2. CoFEE pipeline. Agent 1 performs cognitive feature selection to construct an initial master list, which is refined via semantic similarity by Agent 2. Agent 3 then scores each feature by counting the number of successful and unsuccessful founders exhibiting it. research further supports this perspective, arguing that AI evaluation should mirror human testing practices by explicitly evaluating reasoning processes rather than evaluating only final outputs [7]. In LLMs, such behaviors can be instantiated through prompt-level structure, motivating cognitive prompt- ing as a practical control interface over model reasoning. Motivated by these findings, we introduce CoFEE (Cognitive Feature Engineering Engine), an agent-based pipeline that is used to produce features (Fig. 1). Analogous to how GPTree structures LLM reasoning with decision trees, CoFEE struc- tures LLM reasoning upstream during feature discovery. Using CoFEE, we conduct a controlled empirical study comparing the same pipeline, with one version using cognitive feature discovery carried out by an agent using cognitive prompting, and a baseline using vanilla GPT-5.2 for feature discovery. In this work, we find that cognitive prompting consistently produces features with higher predictability scores. These findings suggest that prompting cognitive behaviors can serve as an effective design strategy for LLM-based feature engi- neering. II. REASONING CONTROL VIA COGNITIVE CONSTRAINTS We use the term reasoning control to describe how we constrain LLMs’ reasoning behavior, implemented in our framework through cognitive prompting. In machine learning terms, this corresponds to imposing structured inductive biases on the feature generation process. These constraints are imple- mented using prompts, enforcing discipline during generation. In CoFEE, reasoning control is implemented through four cognitive behavior constraints as described in Gandhi et al. [5]: • Backward chaining: Start from the desired outcome and reason backward. • Subgoal decomposition: Break down complex tasks into smaller, manageable steps. • Verification: Systematically check the results of each intermediate step. • Backtracking: Revise steps explicitly when they fail. III. DATASET To validate the effectiveness of inducing cognitive behaviors in LLMs, we use a dataset consisting of 1,000 founder profiles collected from publicly available sources. This dataset contains information about the founder and company (e.g., background, roles, sector, funding, etc.), and the success outcome. It consists of 400 (40%) successful and 600 (60%) unsuccessful founders. We classify ’successful’ founders as those that achieve an M&A or IPO valuation exceeding $500M, or raised more than $500M in total funding [8]. A. Held-Out Feature Evaluation To assess whether these features discovered in this pipeline generalize beyond the data used for discovery, we carry out a held-out feature evaluation following standard ML practice. Rather than evaluating on the same data, we explicitly assess these features on a held-out dataset for feature evaluation. 1,000 founders are used for discovery, and we use another 1,000 founders (again with 40% successful) to assess feature performance. The discovery set is passed through the pipeline for discovery and refinement. The held-out evaluation set is never exposed to the feature discovery pipeline. Once discovery is completed, the features are frozen. These frozen features are then applied to the held-out evaluation set, where feature predictive quality is assessed. This ensures that no information in the held-out set influences feature discovery, allowing this evaluation to assess feature generalization. IV. PIPELINE OVERVIEW In this work, we present CoFEE, an agent-based pipeline that explicitly enforces structured cognitive behavior during feature discovery. While models such as GPT-5.2 already ex- hibit certain reasoning behaviors, previous studies suggest that these behaviors can be further strengthened and systematically induced through explicit prompting and structural constraints. With CoFEE, we explore the feasibility of leveraging these enforced cognitive qualities to enhance feature discovery com- pared to vanilla GPT-5.2. The pipeline (see Fig. 2) breaks --- Page 3 --- 3 down this process into three specialized agents (for discovery, scoring, refining) using GPT-5.2 for each agent in this pipeline. Viewed as an ML system, CoFEE implements a genera- tor–evaluator loop in which cognitively constrained reasoning guides feature hypothesis generation prior to evaluation. A. Agent 1: Feature Discovery Agent 1 is the primary agent for this pipeline and is respon- sible for proposing candidate features based on the provided dataset. The agent receives structured prompts that explicitly induce cognitive behaviors, including backward chaining, sub- goal decomposition, verification, and backtracking. Backward chaining is used to reason from the target outcome to features; subgoal decomposition structures feature discovery around high-level causal categories; verification enforces observabil- ity and non-proxy constraints; and backtracking records and rejects invalid reasoning paths. In the vanilla GPT-5.2, Agent 1 proposes features without these cognitive constraints. This agent processes the dataset 50 founders at a time, extracting features and adding them to a master list until the entire dataset is analyzed. The initial prompts for Agent 1 of CoFEE and vanilla prompting appear below. Full prompts are provided in the Supplement. a) CoFEE Prompt (Agent 1): "You are Agent 1: a stateless Feature Discovery agent. You are given a batch of 50 founder records. You do NOT know which founders are successful or unsuccessful. You have NO memory of previous batches. Your task is to propose candidate FEATURES that could plausibly distinguish successful from unsuccessful founders. You are performing Cognitive Feature Reasoning. You must explicitly apply the following cognitive behaviors. You must produce structured outputs and make explicit decisions. -------------------------------- 1. BACKWARD CHAINING -------------------------------- Start from system-level success or failure. For each proposed mechanism: - State the causal hypothesis explicitly. - Explain why this mechanism would operate *before* success. - Map the mechanism to at least one measurable or inferable quantity available in the dataset. You may reason about hidden variables, but the final feature MUST be: - observable pre-success - expressible in deterministic logic If a mechanism cannot be mapped to an observable feature, abandon it. -------------------------------- 2. SUBGOAL SETTING -------------------------------- Organize exploration into NO MORE THAN 4 subgoals chosen from: - founder capability formation - team coordination and complementarity - market structure and constraints - early execution dynamics For each subgoal: - List candidate mechanisms. - Maintain the hierarchy: system behavior → mechanism →feature. If a subgoal: - collapses into a proxy - fails observability - has ambiguous causal direction then explicitly ABANDON or REVISE the subgoal and explain why. -------------------------------- 3. VERIFICATION -------------------------------- For each proposed feature, verify: - it is observable before the success outcome - it encodes a plausible causal mechanism - it is not a prestige-based, descriptive, or post-outcome proxy For each feature, list: - potential bias sources - uncertainty or ambiguity If verification fails, reject the feature. -------------------------------- 4. BACKTRACKING -------------------------------- Explicitly record every abandoned reasoning path. For each abandoned path, record: - why it initially seemed promising - which constraint caused rejection (proxy risk, leakage, observability failure, causal ambiguity) Use these abandoned paths to bias future exploration away from similar dead ends. -------------------------------- b) Vanilla Prompt (Agent 1): "You are Agent 1: a stateless Feature Discovery agent. You are given a batch of 50 founder records. You do NOT know which founders are successful or unsuccessful. You have NO memory of previous batches. Your task is to propose candidate --- Page 4 --- 4 FEATURES that could plausibly distinguish successful from unsuccessful founders. Features must be observable pre-success You must produce structured outputs and make explicit decisions." B. Agent 2: Feature Consolidation Agent 2 identifies semantically overlapping features and merges those that represent the same underlying mechanism as demonstrated in Fig. 3. Merging decisions are conservative and preserve feature provenance to maintain interpretability. This step reduces redundancy while ensuring that distinct causal mechanisms remain separate. Fig. 3. Diagram illustrating the Agent 2 process, in which semantically similar features are compared, their similarity is justified, and the features are combined. C. Agent 3: Scoring Agent 3 evaluates features by comparing feature name and definitions against founder records in batches of up to 100 features and 1,000 founders at a time. When a feature matches a founder, the founder is tagged with that feature. After all founders have been tagged, Agent 3 outputs a JSON file recording feature assignments. This output is then used to deterministically compute feature statistics. n1 denoting the number of successful founders exhibiting a given feature, and n0 denoting the number of unsuccessful founders exhibiting the feature. We define the success-rate delta (∆SR) as the difference between the success probability conditioned on feature presence and the success probability conditioned on feature absence: ∆SR = n1 n1 + n0 − (N1 −n1) (N1 −n1) + (N0 −n0) (1) where N1 and N0 denote the total numbers of successful and unsuccessful founders in the dataset, respectively. V. EXPERIMENT We evaluate the impact of cognitively structured prompting on feature discovery using a controlled experimental setup. We compare two feature discovery conditions: • Cognitive Prompting: where Agent 1 is constrained to explicitly apply the cognitive behaviors: backward chaining, subgoal decomposition, verification, and back- tracking. • Vanilla Prompting: where Agent 1 is prompted to pro- pose features without explicit cognitive constraints. Between the two conditions, the remainder of the pipeline components are held constant (Agents 2–3). This ensures that any observable differences can be attributed to differences in Agent 1 prompting. A. Evaluation Metric For each feature, we compute a success-rate delta (∆SR), which measures the difference in success rates between founders who exhibit the feature and those who do not. This is computed by comparing the prevalence of the feature among successful founders to its prevalence among unsuc- cessful founders, with larger differences indicating stronger discriminative power. This metric allows us to quantify how effectively a given feature set separates successful founders from unsuccessful ones. To evaluate whether cognitive prompting improves feature quality, we compare the top ten features ranked by ∆SR for each experimental condition. By examining differences in these top-ranked features, we assess whether cognitive prompt- ing leads to systematically stronger or more discriminative features. VI. RESULTS A. Feature Discovery Results To ensure statistical stability and modeling relevance, we restrict analysis to features with marginal support ≥10% of the sample (n ≥100). For a binomial proportion with n = 100, the maximum standard error (attained at p = 0.5) is 0.05. This threshold filters high-variance rare features that are unlikely to generalize out-of-sample and concentrates evaluation on predictors with meaningful population coverage. The top 10 features discovered using CoFEE (cognitive prompting) are shown in Table I. The top 10 features discovered using vanilla prompting are shown in Table II. Finally, the side-by-side comparison of features discovered by CoFEE and vanilla prompting are shown in Table III. Further metric comparisons are presented in the Appendix. TABLE I TOP COFEE FEATURES RANKED BY SUCCESS-RATE DELTA (∆SR) Feature ID n1 n0 ∆SR top_university_education_flag 82 37 0.328 education_top10_qs_flag 67 33 0.300 highest_degree_level 304 292 0.272 education_top50_qs_flag 70 41 0.259 technical_background_flag 190 155 0.230 functional_role_diversity 241 222 0.224 job_count_total 254 241 0.224 job_tenure_longest_bucket 284 289 0.224 cross_industry_breadth_count 252 239 0.222 functional_breadth_score 245 229 0.222 --- Page 5 --- 5 TABLE II TOP VANILLA FEATURES RANKED BY SUCCESS-RATE DELTA (∆SR) Feature ID n1 n0 ∆SR has_top10_qs_education 81 34 0.344 data_completeness_score 281 277 0.234 max_role_seniority_level 269 269 0.216 education_qs_rank_best_numeric 217 201 0.205 has_senior_executive_role 209 192 0.202 functional_background_primary 282 302 0.199 role_seniority_score_max 262 273 0.193 founder_tenure_years_estimate 144 122 0.193 max_seniority_level 263 275 0.192 has_any_industry_match_prior_job_flag 116 97 0.184 TABLE III COMPARISON OF FEATURE QUALITY AND EFFICIENCY BETWEEN COFEE AND VANILLA PROMPTING. FEATURE QUALITY IS EVALUATED USING SUCCESS-RATE DELTA (∆SR) ON HELD-OUT DATA. Metric CoFEE Vanilla Mean ∆SR (Top-10) 0.250 0.217 Median ∆SR 0.227 0.204 Total Features Generated 157 222 Total Cost (USD) $8.54 $18.29 B. Feature Discovery Comparison and Interpretation Overall, CoFEE produces a broader distribution of highly predictive features, with consistently higher ∆SR values across the top-ranked features compared to vanilla prompting. VII. DISCUSSION The results of this study suggest that explicitly structur- ing cognitive prompts in an LLM, specifically GPT-5.2, for feature discovery has a measurable impact on the quality of the resulting features. When cognitive behaviors such as backward chaining, subgoal decomposition, verification, and backtracking are enforced, the discovered features achieve higher predictability scores than those generated via vanilla prompting. Because all other pipeline components are held constant, the observed improvements are consistent with the presence of cognitive structure in feature discovery. This indi- cates that cognitive prompting provides a structured constraint on LLM feature generation behavior. Comparing costs across the two approaches, the CoFEE evaluation incurred a total cost of $8.54, whereas the vanilla evaluation cost $18.29. The higher cost of the vanilla approach is attributable to the larger number of features it produced (222 features), compared to CoFEE’s 157 features. The increased feature count in the vanilla setting resulted in a greater number of API calls during the feature scoring and merging stages. These results suggest that inducing cognitive behaviors via prompting functions as an effective inductive bias in LLM- based feature discovery systems. Despite the empirical improvements observed, important limitations remain. First, the results are evaluated on a sin- gle domain, and do not establish generalization beyond this setting. Second, ∆SR captures empirical predictability differ- ences but does not directly assess downstream task perfor- mance. Third, the induced cognitive behavior prompts and the robustness of these effects across alternative models, prompt- ing formulations, and scales have not yet been evaluated. Future work will evaluate CoFEE across multiple domains, measure downstream model performance when incorporating discovered features, and examine robustness across different model architectures and prompt structures. VIII. CONCLUSION CoFEE provides empirical evidence that explicit reasoning control can improve LLM-based feature discovery while re- ducing costs by 53.3% in our evaluated setting. By inducing structured cognitive behaviors, the pipeline produces features with higher ∆SR values obtained at lower computational cost relative to the vanilla prompt baseline. These findings suggest that reasoning control may serve as a practical pipeline design strategy for LLM-based analytical systems. REFERENCES [1] S. Xiong, Y. Ihlamur, F. Alican, and A. O. Yin, “GPTree: Towards explainable decision-making via LLM-powered decision trees,” 2024. [2] J. Heaton, “An empirical analysis of feature engineering for predictive modeling,” 2016. [3] Y. Wang, S. Wu, Y. Zhang, W. Wang, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” 2025. [4] J. Wu, M. Feng, S. Zhang, F. Lv, R. Jin, F. Che, Z. Wen, and J. Tao, “Boosting multimodal reasoning with automated structured thinking,” 2025. [5] K. Gandhi, A. Chakravarthy, A. Singh, N. Lile, and N. D. Goodman, “Cognitive behaviors that enable self-improving reasoners: Four habits of highly effective STaRs,” 2025. [6] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “ReAct: Synergizing reasoning and acting in language models,” 2022. [7] Y. Zhuang, Q. Liu, Z. A. Pardos, P. C. Kyllonen, J. Zu, Z. Huang, S. Wang, and E. Chen, “Position: AI evaluation should learn from how we test humans,” 2024. [8] R. Chen, J. Ternasky, A. S. Kwesi, B. Griffin, A. O. Yin, Z. Salifu, K. Amoaba, X. Mu, F. Alican, and Y. Ihlamur, “VCBench: Benchmarking LLMs in venture capital,” 2025. IX. APPENDIX Two tables included comparing feature quality metrics for the two discovery approaches. Precision denotes the condi- tional success probability among founders exhibiting the fea- ture, P(Y = 1 | f = 1). Support represents the proportion of the dataset exhibiting the feature, (n1+n0)/(N1+N0), reflect- ing population coverage. Lift measures relative improvement over the baseline success rate and is defined as Precision/0.4, where 0.4 is the global base rate of success in the dataset. TABLE IV TOP COFEE FEATURES WITH EXTENDED METRICS Feature n1 n0 Precision ∆SR Lift Support F1 82 37 0.689 0.328 1.723 0.119 F2 67 33 0.670 0.300 1.675 0.100 F3 304 292 0.510 0.272 1.275 0.596 F4 70 41 0.631 0.259 1.577 0.111 F5 190 155 0.551 0.230 1.377 0.345 F6 241 222 0.521 0.224 1.301 0.463 F7 254 241 0.513 0.224 1.283 0.495 F8 284 289 0.496 0.224 1.239 0.573 F9 252 239 0.513 0.222 1.283 0.491 F10 245 229 0.517 0.222 1.292 0.474 --- Page 6 --- 6 TABLE V TOP VANILLA FEATURES WITH EXTENDED METRICS Feature n1 n0 Precision ∆SR Lift Support F1 81 34 0.704 0.344 1.761 0.115 F2 281 277 0.504 0.234 1.259 0.558 F3 269 269 0.500 0.216 1.250 0.538 F4 217 201 0.519 0.205 1.298 0.418 F5 209 192 0.521 0.202 1.303 0.401 F6 282 302 0.483 0.199 1.207 0.584 F7 262 273 0.490 0.193 1.224 0.535 F8 144 122 0.541 0.193 1.353 0.266 F9 263 275 0.489 0.192 1.222 0.538 F10 116 97 0.545 0.184 1.362 0.213 X. SUPPLEMENTARY To ensure reproducibility and structured evaluation of fea- ture discovery outputs, we provide the exact JSON schema used by Agent 1. This template captures not only the final fea- ture specification, but also the intermediate reasoning structure (subgoals, causal mechanisms, and abandoned ideas), enabling inspection of reasoning control effects. 1 { 2 "batch_id": "string", 3 "features": [ 4 { 5 "feature_name": "string", 6 "subgoal": "string", 7 "causal_mechanism": "string", 8 "definition": "string", 9 "computation_logic": "string", 10 "abandoned_ideas": [ 11 { 12 "idea": "string", 13 "reason": "string" 14 } 15 ] 16 } 17 ] 18 }

Chunks

Chunk 0

Chunk 1

LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery.

Chunk 2

We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model.

Chunk 3

These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposi- tion, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. CoFEE does not modify model architecture, training, or inference; reasoning control is implemented entirely through structured prompts.

Chunk 4

In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%.

Chunk 5

Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery.

Chunk 6

Index Terms—Reasoning Control, Feature Engineering, Induc- tive Bias I. INTRODUCTION F EATURE discovery is a critical step in analytical pipelines, particularly in domains that rely heavily on unstructured or complex data such as venture capital (VC).

Chunk 7

In these settings, relevant signals are often implicit, non-linear, and distributed across unstructured sources, making manual feature construction difficult. While interpretable, rule-based models such as linear mod- els and decision trees offer transparent, auditable reasoning, they often struggle in capturing the complexity of available data [1].

Chunk 8

This leads to the inability to maximize the predictive This work was done outside of the current role that Yagiz Ihlamur holds at Amazon. Correspondence: Maximilian Westermann <maximil- ian.westermann@wadham.ox.ac.uk>.

Chunk 9

Fig. 1.

Chunk 10

Overview of the CoFEE pipeline. effectiveness of features.

Chunk 11

Recent work such as GPTree ad- dresses this limitation partially by integrating large language models (LLMs) into a decision tree, allowing transparent and interpretable decision tree reasoning to provide valuable information for VC decision-making [1]. A complementary challenge, feature engineering, arises upstream of model construction.

Chunk 12

Previous work has shown that well-structured features can improve predictability across learning algorithms [2], yet feature discovery remains a major bottleneck, especially in the extraction of unstructured data. LLMs offer automated feature discovery as a potential solu- tion, with their ability to interpret large amounts of data using reasoning [3].

Chunk 13

However, insufficient prompting of LLMs to propose features can produce features with limited predictabil- ity [4]. This raises a key question: How can reasoning behaviors induced by cognitive prompting act as inductive biases to pro- duce highly predictive and interpretable features using LLMs?

Chunk 14

Recent work has identified a set of reasoning behaviors such as backward chaining, subgoal decomposition, verification, and backtracking that correlate with improved learning and self-improvement in language models [5], [6]. Complementary arXiv:2604.21584v1 [cs.AI] 23 Apr 2026 --- Page 2 --- 2 Fig.

Chunk 15

2. CoFEE pipeline.

Chunk 16

Agent 1 performs cognitive feature selection to construct an initial master list, which is refined via semantic similarity by Agent 2. Agent 3 then scores each feature by counting the number of successful and unsuccessful founders exhibiting it.

Chunk 17

research further supports this perspective, arguing that AI evaluation should mirror human testing practices by explicitly evaluating reasoning processes rather than evaluating only final outputs [7]. In LLMs, such behaviors can be instantiated through prompt-level structure, motivating cognitive prompt- ing as a practical control interface over model reasoning.

Chunk 18

Motivated by these findings, we introduce CoFEE (Cognitive Feature Engineering Engine), an agent-based pipeline that is used to produce features (Fig. 1).

Chunk 19

Analogous to how GPTree structures LLM reasoning with decision trees, CoFEE struc- tures LLM reasoning upstream during feature discovery. Using CoFEE, we conduct a controlled empirical study comparing the same pipeline, with one version using cognitive feature discovery carried out by an agent using cognitive prompting, and a baseline using vanilla GPT-5.2 for feature discovery.

Chunk 20

In this work, we find that cognitive prompting consistently produces features with higher predictability scores. These findings suggest that prompting cognitive behaviors can serve as an effective design strategy for LLM-based feature engi- neering.

Chunk 21

II. REASONING CONTROL VIA COGNITIVE CONSTRAINTS We use the term reasoning control to describe how we constrain LLMs’ reasoning behavior, implemented in our framework through cognitive prompting.

Chunk 22

In machine learning terms, this corresponds to imposing structured inductive biases on the feature generation process. These constraints are imple- mented using prompts, enforcing discipline during generation.

Chunk 23

In CoFEE, reasoning control is implemented through four cognitive behavior constraints as described in Gandhi et al. [5]: • Backward chaining: Start from the desired outcome and reason backward.

Chunk 24

• Subgoal decomposition: Break down complex tasks into smaller, manageable steps. • Verification: Systematically check the results of each intermediate step.

Chunk 25

• Backtracking: Revise steps explicitly when they fail. III.

Chunk 26

DATASET To validate the effectiveness of inducing cognitive behaviors in LLMs, we use a dataset consisting of 1,000 founder profiles collected from publicly available sources. This dataset contains information about the founder and company (e.g., background, roles, sector, funding, etc.), and the success outcome.

Chunk 27

It consists of 400 (40%) successful and 600 (60%) unsuccessful founders. We classify ’successful’ founders as those that achieve an M&A or IPO valuation exceeding $500M, or raised more than $500M in total funding [8].

Chunk 28

A. Held-Out Feature Evaluation To assess whether these features discovered in this pipeline generalize beyond the data used for discovery, we carry out a held-out feature evaluation following standard ML practice.

Chunk 29

Rather than evaluating on the same data, we explicitly assess these features on a held-out dataset for feature evaluation. 1,000 founders are used for discovery, and we use another 1,000 founders (again with 40% successful) to assess feature performance.

Chunk 30

The discovery set is passed through the pipeline for discovery and refinement. The held-out evaluation set is never exposed to the feature discovery pipeline.

Chunk 31

Once discovery is completed, the features are frozen. These frozen features are then applied to the held-out evaluation set, where feature predictive quality is assessed.

Chunk 32

This ensures that no information in the held-out set influences feature discovery, allowing this evaluation to assess feature generalization. IV.

Chunk 33

PIPELINE OVERVIEW In this work, we present CoFEE, an agent-based pipeline that explicitly enforces structured cognitive behavior during feature discovery. While models such as GPT-5.2 already ex- hibit certain reasoning behaviors, previous studies suggest that these behaviors can be further strengthened and systematically induced through explicit prompting and structural constraints.

Chunk 34

With CoFEE, we explore the feasibility of leveraging these enforced cognitive qualities to enhance feature discovery com- pared to vanilla GPT-5.2. The pipeline (see Fig.

Chunk 35

2) breaks --- Page 3 --- 3 down this process into three specialized agents (for discovery, scoring, refining) using GPT-5.2 for each agent in this pipeline. Viewed as an ML system, CoFEE implements a genera- tor–evaluator loop in which cognitively constrained reasoning guides feature hypothesis generation prior to evaluation.

Chunk 36

A. Agent 1: Feature Discovery Agent 1 is the primary agent for this pipeline and is respon- sible for proposing candidate features based on the provided dataset.

Chunk 37

The agent receives structured prompts that explicitly induce cognitive behaviors, including backward chaining, sub- goal decomposition, verification, and backtracking. Backward chaining is used to reason from the target outcome to features; subgoal decomposition structures feature discovery around high-level causal categories; verification enforces observabil- ity and non-proxy constraints; and backtracking records and rejects invalid reasoning paths.

Chunk 38

In the vanilla GPT-5.2, Agent 1 proposes features without these cognitive constraints. This agent processes the dataset 50 founders at a time, extracting features and adding them to a master list until the entire dataset is analyzed.

Chunk 39

The initial prompts for Agent 1 of CoFEE and vanilla prompting appear below. Full prompts are provided in the Supplement.

Chunk 40

a) CoFEE Prompt (Agent 1): "You are Agent 1: a stateless Feature Discovery agent. You are given a batch of 50 founder records.

Chunk 41

You do NOT know which founders are successful or unsuccessful. You have NO memory of previous batches.

Chunk 42

Your task is to propose candidate FEATURES that could plausibly distinguish successful from unsuccessful founders. You are performing Cognitive Feature Reasoning.

Chunk 43

You must explicitly apply the following cognitive behaviors. You must produce structured outputs and make explicit decisions.

Chunk 44

-------------------------------- 1. BACKWARD CHAINING -------------------------------- Start from system-level success or failure.

Chunk 45

For each proposed mechanism: - State the causal hypothesis explicitly. - Explain why this mechanism would operate *before* success.

Chunk 46

- Map the mechanism to at least one measurable or inferable quantity available in the dataset. You may reason about hidden variables, but the final feature MUST be: - observable pre-success - expressible in deterministic logic If a mechanism cannot be mapped to an observable feature, abandon it.

Chunk 47

-------------------------------- 2. SUBGOAL SETTING -------------------------------- Organize exploration into NO MORE THAN 4 subgoals chosen from: - founder capability formation - team coordination and complementarity - market structure and constraints - early execution dynamics For each subgoal: - List candidate mechanisms.

Chunk 48

- Maintain the hierarchy: system behavior → mechanism →feature. If a subgoal: - collapses into a proxy - fails observability - has ambiguous causal direction then explicitly ABANDON or REVISE the subgoal and explain why.

Chunk 49

-------------------------------- 3. VERIFICATION -------------------------------- For each proposed feature, verify: - it is observable before the success outcome - it encodes a plausible causal mechanism - it is not a prestige-based, descriptive, or post-outcome proxy For each feature, list: - potential bias sources - uncertainty or ambiguity If verification fails, reject the feature.

Chunk 50

-------------------------------- 4. BACKTRACKING -------------------------------- Explicitly record every abandoned reasoning path.

Chunk 51

For each abandoned path, record: - why it initially seemed promising - which constraint caused rejection (proxy risk, leakage, observability failure, causal ambiguity) Use these abandoned paths to bias future exploration away from similar dead ends. -------------------------------- b) Vanilla Prompt (Agent 1): "You are Agent 1: a stateless Feature Discovery agent.

Chunk 52

You are given a batch of 50 founder records. You do NOT know which founders are successful or unsuccessful.

Chunk 53

You have NO memory of previous batches. Your task is to propose candidate --- Page 4 --- 4 FEATURES that could plausibly distinguish successful from unsuccessful founders.

Chunk 54

Features must be observable pre-success You must produce structured outputs and make explicit decisions." B. Agent 2: Feature Consolidation Agent 2 identifies semantically overlapping features and merges those that represent the same underlying mechanism as demonstrated in Fig.

Chunk 55

3. Merging decisions are conservative and preserve feature provenance to maintain interpretability.

Chunk 56

This step reduces redundancy while ensuring that distinct causal mechanisms remain separate. Fig.

Chunk 57

3. Diagram illustrating the Agent 2 process, in which semantically similar features are compared, their similarity is justified, and the features are combined.

Chunk 58

C. Agent 3: Scoring Agent 3 evaluates features by comparing feature name and definitions against founder records in batches of up to 100 features and 1,000 founders at a time.

Chunk 59

When a feature matches a founder, the founder is tagged with that feature. After all founders have been tagged, Agent 3 outputs a JSON file recording feature assignments.

Chunk 60

This output is then used to deterministically compute feature statistics. n1 denoting the number of successful founders exhibiting a given feature, and n0 denoting the number of unsuccessful founders exhibiting the feature.

Chunk 61

We define the success-rate delta (∆SR) as the difference between the success probability conditioned on feature presence and the success probability conditioned on feature absence: ∆SR = n1 n1 + n0 − (N1 −n1) (N1 −n1) + (N0 −n0) (1) where N1 and N0 denote the total numbers of successful and unsuccessful founders in the dataset, respectively. V.

Chunk 62

EXPERIMENT We evaluate the impact of cognitively structured prompting on feature discovery using a controlled experimental setup. We compare two feature discovery conditions: • Cognitive Prompting: where Agent 1 is constrained to explicitly apply the cognitive behaviors: backward chaining, subgoal decomposition, verification, and back- tracking.

Chunk 63

• Vanilla Prompting: where Agent 1 is prompted to pro- pose features without explicit cognitive constraints. Between the two conditions, the remainder of the pipeline components are held constant (Agents 2–3).

Chunk 64

This ensures that any observable differences can be attributed to differences in Agent 1 prompting. A.

Chunk 65

Evaluation Metric For each feature, we compute a success-rate delta (∆SR), which measures the difference in success rates between founders who exhibit the feature and those who do not. This is computed by comparing the prevalence of the feature among successful founders to its prevalence among unsuc- cessful founders, with larger differences indicating stronger discriminative power.

Chunk 66

This metric allows us to quantify how effectively a given feature set separates successful founders from unsuccessful ones. To evaluate whether cognitive prompting improves feature quality, we compare the top ten features ranked by ∆SR for each experimental condition.

Chunk 67

By examining differences in these top-ranked features, we assess whether cognitive prompt- ing leads to systematically stronger or more discriminative features. VI.

Chunk 68

RESULTS A. Feature Discovery Results To ensure statistical stability and modeling relevance, we restrict analysis to features with marginal support ≥10% of the sample (n ≥100).

Chunk 69

For a binomial proportion with n = 100, the maximum standard error (attained at p = 0.5) is 0.05. This threshold filters high-variance rare features that are unlikely to generalize out-of-sample and concentrates evaluation on predictors with meaningful population coverage.

Chunk 70

The top 10 features discovered using CoFEE (cognitive prompting) are shown in Table I. The top 10 features discovered using vanilla prompting are shown in Table II.

Chunk 71

Finally, the side-by-side comparison of features discovered by CoFEE and vanilla prompting are shown in Table III. Further metric comparisons are presented in the Appendix.

Chunk 72

TABLE I TOP COFEE FEATURES RANKED BY SUCCESS-RATE DELTA (∆SR) Feature ID n1 n0 ∆SR top_university_education_flag 82 37 0.328 education_top10_qs_flag 67 33 0.300 highest_degree_level 304 292 0.272 education_top50_qs_flag 70 41 0.259 technical_background_flag 190 155 0.230 functional_role_diversity 241 222 0.224 job_count_total 254 241 0.224 job_tenure_longest_bucket 284 289 0.224 cross_industry_breadth_count 252 239 0.222 functional_breadth_score 245 229 0.222 --- Page 5 --- 5 TABLE II TOP VANILLA FEATURES RANKED BY SUCCESS-RATE DELTA (∆SR) Feature ID n1 n0 ∆SR has_top10_qs_education 81 34 0.344 data_completeness_score 281 277 0.234 max_role_seniority_level 269 269 0.216 education_qs_rank_best_numeric 217 201 0.205 has_senior_executive_role 209 192 0.202 functional_background_primary 282 302 0.199 role_seniority_score_max 262 273 0.193 founder_tenure_years_estimate 144 122 0.193 max_seniority_level 263 275 0.192 has_any_industry_match_prior_job_flag 116 97 0.184 TABLE III COMPARISON OF FEATURE QUALITY AND EFFICIENCY BETWEEN COFEE AND VANILLA PROMPTING. FEATURE QUALITY IS EVALUATED USING SUCCESS-RATE DELTA (∆SR) ON HELD-OUT DATA.

Chunk 73

Metric CoFEE Vanilla Mean ∆SR (Top-10) 0.250 0.217 Median ∆SR 0.227 0.204 Total Features Generated 157 222 Total Cost (USD) $8.54 $18.29 B. Feature Discovery Comparison and Interpretation Overall, CoFEE produces a broader distribution of highly predictive features, with consistently higher ∆SR values across the top-ranked features compared to vanilla prompting.

Chunk 74

VII. DISCUSSION The results of this study suggest that explicitly structur- ing cognitive prompts in an LLM, specifically GPT-5.2, for feature discovery has a measurable impact on the quality of the resulting features.

Chunk 75

When cognitive behaviors such as backward chaining, subgoal decomposition, verification, and backtracking are enforced, the discovered features achieve higher predictability scores than those generated via vanilla prompting. Because all other pipeline components are held constant, the observed improvements are consistent with the presence of cognitive structure in feature discovery.

Chunk 76

This indi- cates that cognitive prompting provides a structured constraint on LLM feature generation behavior. Comparing costs across the two approaches, the CoFEE evaluation incurred a total cost of $8.54, whereas the vanilla evaluation cost $18.29.

Chunk 77

The higher cost of the vanilla approach is attributable to the larger number of features it produced (222 features), compared to CoFEE’s 157 features. The increased feature count in the vanilla setting resulted in a greater number of API calls during the feature scoring and merging stages.

Chunk 78

These results suggest that inducing cognitive behaviors via prompting functions as an effective inductive bias in LLM- based feature discovery systems. Despite the empirical improvements observed, important limitations remain.

Chunk 79

First, the results are evaluated on a sin- gle domain, and do not establish generalization beyond this setting. Second, ∆SR captures empirical predictability differ- ences but does not directly assess downstream task perfor- mance.

Chunk 80

Third, the induced cognitive behavior prompts and the robustness of these effects across alternative models, prompt- ing formulations, and scales have not yet been evaluated. Future work will evaluate CoFEE across multiple domains, measure downstream model performance when incorporating discovered features, and examine robustness across different model architectures and prompt structures.

Chunk 81

VIII. CONCLUSION CoFEE provides empirical evidence that explicit reasoning control can improve LLM-based feature discovery while re- ducing costs by 53.3% in our evaluated setting.

Chunk 82

By inducing structured cognitive behaviors, the pipeline produces features with higher ∆SR values obtained at lower computational cost relative to the vanilla prompt baseline. These findings suggest that reasoning control may serve as a practical pipeline design strategy for LLM-based analytical systems.

Chunk 83

REFERENCES [1] S. Xiong, Y.

Chunk 84

Ihlamur, F. Alican, and A.

Chunk 85

O. Yin, “GPTree: Towards explainable decision-making via LLM-powered decision trees,” 2024.

Chunk 86

[2] J. Heaton, “An empirical analysis of feature engineering for predictive modeling,” 2016.

Chunk 87

[3] Y. Wang, S.

Chunk 88

Wu, Y. Zhang, W.

Chunk 89

Wang, Z. Liu, J.

Chunk 90

Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” 2025.

Chunk 91

[4] J. Wu, M.

Chunk 92

Feng, S. Zhang, F.

Chunk 93

Lv, R. Jin, F.

Chunk 94

Che, Z. Wen, and J.

Chunk 95

Tao, “Boosting multimodal reasoning with automated structured thinking,” 2025. [5] K.

Chunk 96

Gandhi, A. Chakravarthy, A.

Chunk 97

Singh, N. Lile, and N.

Chunk 98

D. Goodman, “Cognitive behaviors that enable self-improving reasoners: Four habits of highly effective STaRs,” 2025.

Chunk 99

[6] S. Yao, J.

Chunk 100

Zhao, D. Yu, N.

Chunk 101

Du, I. Shafran, K.

Chunk 102

Narasimhan, and Y. Cao, “ReAct: Synergizing reasoning and acting in language models,” 2022.

Chunk 103

[7] Y. Zhuang, Q.

Chunk 104

Liu, Z. A.

Chunk 105

Pardos, P. C.

Chunk 106

Kyllonen, J. Zu, Z.

Chunk 107

Huang, S. Wang, and E.

Chunk 108

Chen, “Position: AI evaluation should learn from how we test humans,” 2024. [8] R.

Chunk 109

Chen, J. Ternasky, A.

Chunk 110

S. Kwesi, B.

Chunk 111

Griffin, A. O.

Chunk 112

Yin, Z. Salifu, K.

Chunk 113

Amoaba, X. Mu, F.

Chunk 114

Alican, and Y. Ihlamur, “VCBench: Benchmarking LLMs in venture capital,” 2025.

Chunk 115

IX. APPENDIX Two tables included comparing feature quality metrics for the two discovery approaches.

Chunk 116

Precision denotes the condi- tional success probability among founders exhibiting the fea- ture, P(Y = 1 | f = 1). Support represents the proportion of the dataset exhibiting the feature, (n1+n0)/(N1+N0), reflect- ing population coverage.

Chunk 117

Lift measures relative improvement over the baseline success rate and is defined as Precision/0.4, where 0.4 is the global base rate of success in the dataset. TABLE IV TOP COFEE FEATURES WITH EXTENDED METRICS Feature n1 n0 Precision ∆SR Lift Support F1 82 37 0.689 0.328 1.723 0.119 F2 67 33 0.670 0.300 1.675 0.100 F3 304 292 0.510 0.272 1.275 0.596 F4 70 41 0.631 0.259 1.577 0.111 F5 190 155 0.551 0.230 1.377 0.345 F6 241 222 0.521 0.224 1.301 0.463 F7 254 241 0.513 0.224 1.283 0.495 F8 284 289 0.496 0.224 1.239 0.573 F9 252 239 0.513 0.222 1.283 0.491 F10 245 229 0.517 0.222 1.292 0.474 --- Page 6 --- 6 TABLE V TOP VANILLA FEATURES WITH EXTENDED METRICS Feature n1 n0 Precision ∆SR Lift Support F1 81 34 0.704 0.344 1.761 0.115 F2 281 277 0.504 0.234 1.259 0.558 F3 269 269 0.500 0.216 1.250 0.538 F4 217 201 0.519 0.205 1.298 0.418 F5 209 192 0.521 0.202 1.303 0.401 F6 282 302 0.483 0.199 1.207 0.584 F7 262 273 0.490 0.193 1.224 0.535 F8 144 122 0.541 0.193 1.353 0.266 F9 263 275 0.489 0.192 1.222 0.538 F10 116 97 0.545 0.184 1.362 0.213 X.

Chunk 118

SUPPLEMENTARY To ensure reproducibility and structured evaluation of fea- ture discovery outputs, we provide the exact JSON schema used by Agent 1. This template captures not only the final fea- ture specification, but also the intermediate reasoning structure (subgoals, causal mechanisms, and abandoned ideas), enabling inspection of reasoning control effects.

Chunk 119

1 { 2 "batch_id": "string", 3 "features": [ 4 { 5 "feature_name": "string", 6 "subgoal": "string", 7 "causal_mechanism": "string", 8 "definition": "string", 9 "computation_logic": "string", 10 "abandoned_ideas": [ 11 { 12 "idea": "string", 13 "reason": "string" 14 } 15 ] 16 } 17 ] 18 }

Back to search