Research Papers

To See The Unseen, On The Generalization Ability Of Transformers In Symbolic Reasoning

Document ID: research-papers-to-see-the-unseen-on-the-generalization-ability-of-transformers-in-symbolic-reasoning

Full content

--- Page 1 --- Preprint. Under review. To See the Unseen: on the Generalization Ability of Trans- formers in Symbolic Reasoning Nevena Lazi´c Liam Fowl András György Csaba Szepesvári Abstract We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reason- ing problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that were not ob- served during training, and it was shown that one reason behind this is the difficulty of copying (or generating) unseen tokens. We show both theoretically and empirically that a particular representational collapse also has a crucial role: the unembeddings (last-layer weights) of unseen tokens collapse to nearly the same vector during training. The collapse makes distinguishing multiple unseen variables difficult for the model (especially when the embedding and unembedding parameters are shared), and pro- vides a mechanistic explanation for the effectiveness of existing heuristic interventions like "active forgetting", which periodically reset the token (un)embeddings. Based on these observations, we devise a combination of techniques, involving a small architecture change facilitating copying, data diversity, and freezing or resetting (un)embeddings, that achieves general- ization to unseen tokens. We support our claims with extensive controlled experiments on propositional logic reasoning problems. Beyond synthetic experiments, we also observe evidence of (un)embedding collapse in the open-weight models in the Gemma 3 family, which includes 99 unused tokens reserved for downstream use. Empirically we find that the corre- lated embeddings of these tokens are a poor initialization for finetuning applications. 1 Introduction Modern transformer-based large language models (LLMs) (Vaswani et al., 2017) trained using next-token prediction have achieved significant success across a wide range of tasks, including benchmarks requiring formal reasoning such as mathematics and science. At the same time, researchers continue to uncover simple ways in which transformers fail on reasoning problems – for example, by changing names and numerical values or inserting irrelevant information (Jiang et al., 2024; Mirzadeh et al., 2024). This fragility suggests that LLMs are not yet capable of fully general reasoning and can rely on superficial patterns to produce answers. In this work, we investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning. Specifically, we study the models’ ability to execute in-context algorithms on inputs containing arbitrary new tokens as symbols (see Figure 1 for an example). Thus, the models need to generalize solely based on the problem structure and not based on the content of token embeddings. This problem was studied by Boix-Adserà et al. (2024) in the neural tangent kernel (NTK) regime, with only the (un)embedding1 parameters trainable. They showed that transformers can learn to reason with unseen tokens given 1We consider transformers with tied embedding and unembedding (first- and last-layer, resp.) matrices, which we will refer to as the (un)embedding parameters. Here the same parameter matrix is used for the embedding and unembedding layers, the latter being a transpose of the former. We will refer to these parameters as the (un)embedding. This architecture choice is used in Boix-Adserà et al. (2024) and also commonly used in practice, for example in the Gemma 3 models (Gemma Team et al., 2025). 1 arXiv:2604.21632v1 [cs.AI] 23 Apr 2026 --- Page 2 --- Preprint. Under review. Rules: If A then B. Facts: A. Query: B ? Reasoning: Facts: A. If A then B. Facts: A, B. Answer: yes Rules: If A then U. Facts: A. Query: U? Rules: If U1 then U2. Facts: U1. Query: U2? Figure 1: Left: An example of a propositional logic problem with reasoning used for training. Given a set of rules (definite clauses), a set of facts (true zero-order predicates), and a query predicate, the goal is to compute the truth value of the query. The reasoning trace (shown above) executes the forward-chaining algorithm, and stops if the query is proven true. Right: Two types of test examples. At test time, we use unseen symbols (denoted here by U, U1, U2) for either just the query variable, or for all variable names. highly diverse training data, but fail at tasks that require copying (or generating) unseen tokens, and proposed a simple architecture change to improve copying capabilities. Anand et al. (2025) studied generalization to unseen tokens empirically and found that periodically resetting the embeddings (or "active forgetting") during training helps, but did not provide an explanation beyond not storing information in the embeddings. Furthermore, they did not consider tasks requiring symbolic copying, which remains difficult with their approach. Our work identifies a key issue in generalization to unseen tokens that has been overlooked by previous work. We find that the (un)embeddings of the unseen tokens collapse to nearly the same vector during the course of training. This leads transformers to struggle when unseen tokens are used to represent different variable names, as the corresponding variables become nearly indistinguishable. We prove theoretically that this collapse is inevitable when training transformers using SGD with weight decay and layernorm, and provide empirical evidence of it occurring in practice. Our work shows that the architecture change proposed by Boix-Adserà et al. (2024) is ineffective on its own in the presence of multiple unseen variables. The identified embedding collapse also explains the usefulness of the periodic resetting strategy of Anand et al. (2025). We also show that reliable symbolic reasoning can be achieved using a combination of a copy-enabled architecture, high symbolic diversity in the data, and either freezing or periodically resetting (un)embeddings to prevent collapse. We validate our findings using extensive controlled experiments on a testbed of synthetic propositional logic problems similar to the examples in Figure 1. Beyond experiments on synthetic data, we also investigate the implications of our findings on larger open-weights models. We find that (un)embedding collapse is also present in the Gemma 3 model family (Gemma Team et al., 2025), which has 99 deliberately unused tokens for downstream use. We demonstrate empirically that the correlated embeddings of these tokens are a poor initialization for downstream finetuning applications. 2 Related Work Symbolic reasoning Several previous works have demonstrated that transformers suffer from token bias (Jiang et al., 2024): perturbing certain input tokens can induce predictable changes in the LLM output. Jiang et al. (2024) show that changing names, inserting celebrity references and irrelevant content, and replacing quantifiers with synonyms degrades LLM performance on logical fallacies. Mirzadeh et al. (2024) introduce a version of the GSM8k mathematical reasoning benchmarks with modified numerical values and inserted irrelevant information. McCoy et al. (2024a;b) show that the accuracy of LLMs is heavily influenced by the likelihood of task formulations, inputs, or outputs appearing in the training data. All these works suggest that transformers can rely on statistical shortcuts rather than performing true reasoning. Boix-Adserà et al. (2024) study the ability of transformers to reason with abstract symbols on relational tasks, in the neural tangent kernel (NTK) limit. They show that on regression-type tasks (computing a numerical label based on a symbol pattern), transformers can generalize to unseen tokens when trained on data with a large amount of symbolic substitution. On the other hand, they show that tasks involving generating an unseen token (such as copying from context) are considerably more difficult and may require architectural changes. Anand et al. (2025) define structural in-context learning as the ability 2 --- Page 3 --- Preprint. Under review. of a model to execute in-context learning (ICL) on arbitrary novel tokens, which is similar to symbolic reasoning. They study structural ICL on simple part-of-speech classification problems, and find that it is transient. They observe that it can be maintained with an active forgetting strategy (Chen et al., 2023), which resets the embedding matrix every k steps, and propose temporary active forgetting, where this is only done at the beginning of training. However, their work does not explain any mechanisms behind the failure. Copying architectures Multiple works explore neural architectures that enable copying of inputs, where attending to input tokens increases the probability of copying them. Ontanon et al. (2022) mix a distribution over input tokens with the standard transformer predictive distribution and find that this is helpful in solving compositional tasks that involve symbolic reasoning. Boix-Adserà et al. (2024) mix the copying component at the level of logits, and our work studies a similar architectural change. Similar copying mechanisms to these have previously been studied in sequence-to-sequence models, starting with the pointer networks of Vinyals et al. (2015) (see, e.g., See et al., 2017; Gu et al., 2016). In-context learning (ICL) Token bias can partially be attributed to models inappropriately using in-weights knowledge in place of context information. A recent line of works examines the effect of distributional properties of the training data on the emergence of ICL (Chan et al., 2022a; Reddy, 2023; Singh et al., 2023; Chan et al., 2025). These works find that ICL can emerge in transformers when there is a large number of classes and large within-class variation. Chan et al. (2022a;b) further observe that ICL and in-weight learning (IWL) can emerge simultaneously when the distribution over classes is Zipfian. Chan et al. (2025) propose a simplified model that uses a gating mechanism to choose between IWL and ICL predictors based on test error. Since the test error is unavailable, it is approximated by the best choice in hindsight based on the difference between the losses of the two predictors. Multiple works have noticed that ICL can become transient in an asymptotic training regime (Chan et al., 2022a;b; 2025; Panwar et al., 2024). In the setting of ICL for ridge regression, Raventós et al. (2023) show that models trained on a small number of tasks behave like the Bayesian estimator over the training-task distribution, but is able to generalize to unseen tasks (implementing essentially ridge regression) when the number of training tasks is sufficiently large. He et al. (2024) study ICL for modular arithmetic and find that models transition from in-distribution to out-of-distribution generalization as the number of pretraining tasks increases. A common observation in these papers is that data diversity is helpful for ICL, and we observe the same in our problems. Chain-of-thought (CoT) reasoning CoT reasoning enables LLMs to generate intermediate computational steps prior to finalizing the answer to a question. CoT has played a key role in LLMs achieving exceptional performance on tasks such as mathematical problem-solving and code generation (Chowdhery et al., 2023; Achiam et al., 2023; Anil et al., 2023; Trinh et al., 2024; Romera-Paredes et al., 2024). Here, we study problems in which CoT reasoning requires computing quantities based on previously unseen symbolic variables, as well as referring to unseen symbolic variables by copying them. Randomized embeddings Randomized position embeddings have been shown to help with length generalization in arithmetic tasks (Ruoss et al., 2023; Shen et al., 2023). This is because conventional positional embeddings are out-of-distribution when increasing the sequence length beyond the training-time window. Our work shows that using randomized embeddings for all tokens can be helpful in generalization to unseen tokens. 3 Transformer architecture Given a vocabulary V, a decoder-only transformer with parameters θ maps a sequence of tokens (x1, . . . , xn) ∈Vn to a probability distribution over V, denoted by pθ(·|x1, . . . , xn). Let X be an n × |V| matrix of stacked indicator vectors for the input tokens. The architecture of an L-block transformer with a single attention head per block is as follows.2 The tokens 2We describe a single attention head for simplicity of exposition. Our empirical results rely on multi-head attention. 3 --- Page 4 --- Preprint. Under review. are first embedded into a d-dimensional space using the embedding/unembedding matrix WE as Y0 = XWE. The L transformer blocks have the following form, defined recursively over the transformer blocks i = 0, 1, 2, . . . , L −1: Y0 = XWE, Ai = attni(layernorm(Yi)), Yi+1 = Yi + Ai + MLPi(layernorm(Yi + Ai)) where • layernorm (Ba et al., 2016) normalizes the per-example activations of a layer to have mean and standard deviation close to 0 and 1, respectively, and optionally rescales them by a learned parameter; • MLPi is a feed-forward network mapping its input as MLPi(Y) = max(0, YW(i) 1 + b(i) 1 )W(i) 2 + b(i) 2 , where W(i) 1 , b(i) 1 , W(i) 2 , and b(i) 2 are trainable; • attni computes a linear projection of the inputs into query, key, and value matrices using trainable parameters W(i) Q , W(i) K , W(i) V , respectively. For each token, the query and key are used to compute a weighted combination of the values of the preceding tokens: attni(Y) = softmax(M ⊙(YW(i) Q W(i)⊤ K Y⊤)/ √ d)YW(i) V W(i) O where M is a causal mask and W(i) O is the output projection matrix. We incorporate position information by applying RoPE (Su et al., 2024) to the query and key matrices. The next-token probabilities are computed as p(·|X) = softmax(layernorm(YL)W⊤ E ). This is an n × |V| matrix, whose last row gives the probability of the (n + 1)st token given x1, . . . , xn. Note that we assume that the embedding and unembedding matrices are tied (WE and W⊤ E , resp.). Copy attention heads As discussed earlier, copying symbols from the input to the output is often needed when reasoning about in-context information (such as in the logic tasks of Figure 1). Unfortunately, this can be challenging in the case of unseen tokens that models have been trained not to generate. Here we describe a minor architecture change that is helpful in copying both seen and unseen symbols. Let Aj denote the jth row of a matrix A. Assuming that the rows of WE are almost orthogonal and almost normalized, a transformer will copy a token xi to position j + 1 if layernorm(YL)j ≈WE,xi. However, it may be difficult for the transformer to propagate an arbitrary input embedding xi through L layers involving MLPs. We thus augment the architecture with a shortcut to the input embeddings, which we name copy attention, implemented as follows: copyattn(Y) = softmax(M ⊙(YWQW⊤ K Y⊤)/ √ d)XWE. Compared to standard attention, we replace the value and output projection YWVWO with the input embeddings XWE. Thus, for each i, the ith row of copyattn(Y) will be a weighted combination of the input embeddings of tokens x1, ..., xi. We now compute the next-token distribution as Z = copyattn(YL−1), pcopy(·|X) = softmax(layernorm(YL + Z)W⊤ E ). Thus, ignoring layernorm, if the embeddings WE are approximately orthonormal and if the copy head sharply attends to a token, it will increase its output probability. Empirically we observe similar benefits of copy attention with and without output layernorm. The described copy attention is a modified version of the architecture change proposed by Boix-Adserà et al. (2024) for a single-layer transformer without MLPs, which outputs attn(X) = softmax(M ⊙(XWEWQW⊤ K W⊤ E X⊤)/ √ d)XWE(WVWO + bI), where b is a train- able scalar and I is the identity matrix. Effectively, this change adds the scaled input embeddings bXWE to the standard value-output parameters WVWO. The main difference in our implementation is that we use separate attention heads for copying (i.e. additional query and key parameters). This enables us to keep the MLPs in the standard attention blocks while not applying them to copy attention, and to trade-off between copying and standard generation more flexibly than via a single parameter. 4 --- Page 5 --- Preprint. Under review. 4 Collapse of softmax weights for unseen labels As we will demonstrate, transformers struggle with symbolic reasoning that requires copy- ing unseen tokens, and one of the reasons for this failure is the collapse of the unembeddings corresponding to those tokens. In this section we provide some insight into this phenomenon. Specifically, we show that once the model fits the data sufficiently well (to be specified next), each step of ℓ2-regularized gradient descent (GD) or stochastic gradient descent (SGD) will decrease the distance between the unembeddings of unseen tokens. In what follows, we consider the general problem of training a softmax classifier on a dataset D = {(xn, yn)}N n=1 of inputs x ∈Rd′ and labels y ∈[K]. Let ϕ(·|θ) : Rd′ →B(r, d) be a neural network, parameterized by θ, mapping the input features to a d-dimensional ball of radius r, B(r, d) = {ψ ∈Rd : ∥ψ∥≤r}. We will use the shorthand ϕn := ϕ(xn|θ) and ϕt n := ϕ(xn|θ(t)), where θ(t) are the parameters at training step t. We consider a model in which the probability of a label is given by a softmax: p(y|x, W, θ) = exp(ϕ(x|θ)⊤Wy) ∑K k=1 exp(ϕ(x|θ)⊤Wk) (1) where Wk are the weights corresponding to class k. The parameters (W, θ) are optimized by minimizing the ℓ2-regularized negative log-likelihood with some regularization coefficient λ > 0: Lλ(W, θ|D) = −1 N N ∑ n=1 ln p(yn|xn, W, θ) + 0.5λ∥W∥2 F The gradient of the loss with respect to parameters Wi is gi := 1 N ∑N n=1(p(i|xn, W, θ) −I[i = yn])ϕn + λWi, where I[i = yn] is a label indicator function. 4.1 Contraction of unseen-label weights for GD/SGD Let W(t) denote the softmax weights at step t. Suppose that none of the examples in the dataset have the label i or label j. The following lemma shows that in this case the parameter vectors Wi and Wj may contract: Lemma 4.1. Suppose that we update the weights of the softmax classifier in Eq (1) using gradient descent with learning rate schedule ηt. If labels i and j are not present in the data, then the weights at time t satisfy W(t+1) j −W(t+1) i ≤ 1 −ληt + ηtr2pt W(t) j −W(t) i where pt := maxk∈{i,j},n∈[N] p(k|xn, W(t), θ(t)). The proof is given in Appendix A. The same result holds for stochastic gradient descent by replacing the full batch with a minibatch, and having pt be the maximum probability of any unseen class for any minibatch example. We note that the result holds regardless of whether the parameters θ(t) are fixed or changing over time, as long as the feature function ϕ(·|θ(t)) has bounded output. Thus, Lemma 4.1 applies, for example, to transformers with layernorm in the last layer. Lemma 4.1 implies that the weights for a pair of unseen labels will contract whenever r2pt < λ. Furthermore, if our learning rate ηt is bounded away from zero, and assuming that the classifier has sufficient capacity to fit the data so that the probability of unseen labels i, j approaches zero for training examples, we see the corresponding weight entries collapse. Thus we have shown that a step of ℓ2-regularized GD/SGD will contract the last-layer weights of unseen labels if the input features have bounded norm and the model fits the data well, even if the features are changing over time. This setting corresponds to modern deep neural networks which overfit training data and use layer normalization to ensure boundedness of the learned features. 5 --- Page 6 --- Preprint. Under review. 0 10000 20000 30000 40000 50000 steps 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 mean abs cos similarity seen tokens, cos sim = 0, no layernorm = 0, layernorm = 0.01, layernorm = 0.1, layernorm 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 mean abs cos similarity unseen tokens, cos sim Figure 2: Mean pairwise cosine similarities of the unembeddings for different hyperparame- ter choices (note the different scale of the vertical axis in the two graphs). In the presence of layernorm, the unembeddings of unseen tokens collapse to nearly the same-direction vector (also note that the actual magnitudes are also very close). This is consistent with Lemma 4.1 which requires the inputs to the softmax function to be bounded (i.e., normalized). Collapse is more severe for higher regularization λ. 4.2 Empirical collapse of unseen-token (un)embeddings under AdamW While our analysis considers SGD, in practice transformers are often trained using the AdamW optimizer (Loshchilov et al., 2017), for which obtaining similar results is more involved. In this section we empirically demonstrate that similar collapse also happens when training transformers using AdamW. In particular, we train transformers on logic problems using a vocabulary with 100 deliberately heldout tokens (see Section 5 for full experimental details), using AdamW with different regularization strengths λ, with and without layernorm in the last layer. Figure 2 shows the mean cosine similarity between the (un)embeddings of seen tokens, and between the (un)embeddings of unseen tokens. We observe that unseen tokens are much more correlated than seen tokens, especially in the presence of layer normalization, even without regularization (λ = 0). 5 Experiments on symbolic propositional logic In this section we study how transformers can learn to reason in the context of proposi- tional logic problems, and demonstrate the issues and solutions highlighted earlier about generalization to unseen symbols. Propositional logic problem We create synthetic propositional logic problems involving only definite clauses (also known as Horn clauses). The goal is to determine the truth value of a query predicate given a set of rules (definite clauses, e.g., ’If A then B’) and facts (true zero-order predicates, e.g., ’A is true’). We generate random synthetic problems similarly to the rule-priority recipe of Zhang et al. (2023), but additionally include chain-of-thought (CoT) reasoning that can be used to derive the answer. The CoT is generated by running the forward-chaining algorithm, which iteratively applies the modus ponens rule to derive new facts until either the query predicate is proven or no further predicates can be proven true. If a predicate cannot be proven true, it is false. See Figure 1 for an example and data format. For logic problems involving only Horn clauses, modus ponens is known to be sound and complete: any predicate entailed by the rules and facts can be derived by iteratively applying modus ponens. The generated examples contain up to 20 predicates per question and up to 40 rules per question. We ensure balanced labels by selecting a true predicate as the query with probability 0.5. Model We train a small 10-layer decoder-only model (60M-75M parameters, depending on the number of predicates) from scratch on this data using the NanoDO library (Liu et al., 2024) and the AdamW optimizer (Loshchilov et al., 2017) with weight decay λ = 0.001. Rather than using a fixed dataset, we continually generate fresh examples on-the-fly during training. We only train the model to predict the tokens corresponding to the reasoning and 6 --- Page 7 --- Preprint. Under review. 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy trainable WE all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 trainable WE unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 trainable WE all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 trainable WE, copy attn all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 trainable WE, copy attn unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 trainable WE, copy attn all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy frozen WE all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 frozen WE unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 frozen WE all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 frozen WE, copy attn all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 frozen WE, copy attn unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 frozen WE, copy attn all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (100, 50k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (100, 10k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy all predicates unseen Figure 3: Propositional logic evaluation accuracy for models trained on single-token symbols. We evaluate accuracy with (1) all tokens seen, (2) query unseen, (3) all predicates unseen. The rows correspond to the unembeddings being (1) trainable (2) frozen (3) periodically reinitialized every 100 steps throughout training, and (4) periodically reinitialized every 100 steps for the first 10k steps. We train each model with and without copy attention (right and left side of the figure). the answer, that is, we mask out the question tokens.3 We experiment both with standard architectures and copy attention, using a custom vocabulary to control the tokenization of the problems. See Appendix B for further details on the data, architecture, and training. Experiment settings In this setting, each logic predicate corresponds to a single token. The predicates for each problem are selected uniformly at random from a set P (and thus the size of P serves as a proxy for symbolic diversity). The vocabulary for each training task includes 100 unused tokens which are subsequently used as predicates during evaluation. We train models with and without copy attention, and with trainable, frozen, and periodically reinitialized (un)embeddings. We vary the size of the training predicate vocabulary ∥P∥∈{150, 600, 2400, 9600} as a proxy for diversity. We evaluate the accuracy of the trained models on a randomly-generated set of 2048 examples with (1) all tokens seen, (2) query unseen, and (3) all predicates unseen. Results and discussion The results are shown in Figure 3. All model versions perform well on in-distribution data (seen tokens). The standard transformer fails when one or more symbols are replaced by unseen tokens. Examining outputs reveals that this is because the model learns not to generate the unseen token (needed for reasoning). The model instead either ignores the token or replaces it with a seen predicate (see Figure 9 for some examples of generated CoT in this case). Adding copy attention and increasing diversity results in good performance on the evaluation with a single unseen token. In fact, early in the training, models with vocabulary size 600, 2400, and 9600 all generalize well and copy the unseen 3In our initial experiments, we found that training the model to predict the randomly-generated questions in addition to answers led to worse performance and increased hallucinations in the CoT. 7 --- Page 8 --- Preprint. Under review. token in their CoT, but only the model with |P| = 9600 keeps this inductive bias. This result is consistent with existing works on in-context learning (ICL), which show that a certain diversity threshold is necessary for generalization (Raventós et al., 2023; He et al., 2024), and that ICL is transient (Chan et al., 2022a; Panwar et al., 2024; Chan et al., 2025). Copy attention fails in the case where all symbolic variables are replaced by unseen tokens, regardless of diversity. This is explainable at least in part by the collapse of the unseen (un)embeddings, which makes these variables difficult to distinguish. Freezing the embeddings to their initialization during training (which are approximately orthogonal and hence well-distinguishable for different tokens) allows all models to gen- eralize to unseen tokens, with the copy-attention models achieving higher accuracy (or having a lower diversity threshold). Similarly, active forgetting (Chen et al., 2023) (reinitial- izing embeddings every 100 steps) generalizes to unseen tokens given sufficient diversity and/or an architecture with copy heads.4 These results further support our hypothesis that symbolic reasoning is impeded by learning not to emit unseen tokens and the unembedding collapse.5 Unfortunately, freezing and reinitializing are somewhat unsatisfactory solutions for general-purpose models, which typically benefit from trainable embeddings.6 Finally, we evaluated temporary active forgetting (Anand et al., 2025) (reinitializing for the first 10k steps of the training). We observed that it does not generalize well for the standard archi- tecture (no copy attention) – performance collapses once reinitialization stops. However, it does generalize well with the copy attention architecture for all but the smallest vocabulary P. While the unseen unembeddings again collapse in this setting (having a cosine similarity of about 0.9), the model with copy attention appears to attend to tokens sharply enough to distinguish different symbols. 6 Experiments on Gemma3 In this section, we investigate whether our findings related to the collapse of embeddings hold for larger (open-weights) models, namely the Gemma 3 series (Gemma Team et al., 2025). These models use tied embedding and unembedding parameters, as assumed throughout. Collapse of unused tokens The Gemma 3 tokenizer vocabulary is of size 262,144 and includes 99 deliberately unused tokens. For each instruction-tuned model size (1B, 4B, 12B, 27B), with embedding dimensions (1152, 2560, 3840, 5376) respectively, we compute the cosine similarities of the embeddings of the unused tokens and a randomly chosen subset of 100 used tokens. The results are shown in Figure 4: in all models, the unused tokens have a higher cosine similarity than the used ones. The mean cosine similarities are (0.78, 0.33, 0.23, 0.24) for the unused tokens and (0.09, 0.03, 0.02, 0.01) for the used tokens.7 Thus the effect of the collapse is smaller for larger models. Finetuning with unused tokens To investigate whether the collapse of unused tokens is an issue for downstream applications relying on such tokens, we finetune Gemma 3 1B IT on propositional logic data constructed similarly as before, with 80 symbols (predicates) corresponding to (1) unused tokens, (2) English adjectives, and (3) ASCII lowercase and uppercase letters and repeated letters (e.g., ’aa’). We verify that all symbols correspond to single tokens. We finetune all parameters using AdamW with a constant learning rate 5 · 10−5 and parameters β1 = 0.9, β2 = 0.999, ϵ = 10−8, λ = 10−4, using batch size of 128. We show evaluation accuracy on a holdout dataset of 512 examples (with the same 4We tried reinitializing every 100 and every 1k steps, for 10k steps or throughout training; see Appendix C for full experiments. 5We have also tried freezing / reinitializing only the embeddings of the predicate tokens. Contrary to our intuition, this yielded worse generalization compared to freezing / reinitializing all embeddings, and requires further investigation. 6See also Figure 10 in the Appendix E for some empirical evidence in this regard, where we compare models with trainable and frozen embeddings on the C4 corpus (Raffel et al., 2020). 7For the pretrained models, the mean cosine similarities are (0.79, 0.43, 0.33, 0.29) for the unused tokens and (0.06, 0.03, 0.02, 0.01) for the used tokens 8 --- Page 9 --- Preprint. Under review. Gemma3 1B IT Gemma3 4B IT Gemma3 12B IT Gemma3 27B IT 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Figure 4: Cosine similarities between the embeddings of the 99 unused tokens and 100 randomly chosen tokens. In all models, the unused tokens are correlated more than the used ones, with the effect more pronounced the smaller the embedding dimension is. 0 1000 2000 3000 4000 5000 step 0.0 0.2 0.4 0.6 0.8 1.0 accuracy unused 0 1000 2000 3000 4000 5000 step 0.0 0.2 0.4 0.6 0.8 1.0 accuracy adjectives 0 1000 2000 3000 4000 5000 step 0.0 0.2 0.4 0.6 0.8 1.0 accuracy letters Figure 5: Accuracy when finetuning Gemma 3 1B IT on logic problems with predicates corresponding to unused tokens, English adjectives, and letters. Curves correspond to random seeds. With adjectives and letters, accuracy reaches 90% in about 500 finetuning steps, whereas with unused tokens reaching 90% accuracy requires about 5000 steps, which may be an artifact of embedding collapse. predicates as the trained model) in Figure 5 for 4 random training runs. We observe that training using unused tokens can be slow: the runs using adjectives and letters as symbols reach ∼90% accuracy in under 500 steps, while the runs involving unseen tokens take about 5000 steps to reach similar performance. The runs involving letters are the most stable, perhaps because these are more commonly used as variable names in the training data. Collapse of rare tokens during instruction tuning Next we investigate whether collapse of embeddings occurs during conventional instruction tuning of models. We compute the singular value decomposition (SVD) of the (un)embedding matrices of Gemma 3 models, normalized by their Frobenius norm. In Figure 13, we plot the difference between the sorted singular values of IT (instruction-tuned) and PT (pre-trained) models. In all models, the IT version places more mass on the top two singular values than the corresponding PT version, with the effect most pronounced in the 1B version. We observe that tokens whose cosine similarity increases between PT and IT versions of Gemma 3 1B often correspond to html tags and non-Latin script symbols; for example the cosine similarity between </h6> and );"> increases from -0.044 to 0.273. Thus it is plausible that these tokens are not included or rarely seen in the instruction tuning data, though we cannot definitively verify this claim. We speculate that this increase in correlation does not meaningfully degrade capabilities. 7 Discussion We have investigated the ability of transformers to reason with abstract symbols represented using previously-unseen tokens. We found that generalization to unseen tokens is hindered by an inductive bias that causes their (un)embeddings to collapse, and proposed methods for addressing this collapse. Our work resolves the issues with architectural interventions proposed by Boix-Adserà et al. (2024) in the presence of multiple unseen variables, and also helps explain the benefits of temporary active forgetting (Anand et al., 2025). We have also observed some evidence of collapse in the open-weight Gemma 3 models, and found that finetuning these models is considerably slower with unseen tokens. Some limitations of our work are that the theoretical analysis of the collapse includes additional assumptions and that most of our experiments are on small models and only on propositional logic problems. Interesting directions for future work include strategies for improving finetuning with collapsed unused tokens, as well as investigating reasoning with multi-token symbols (a preliminary study on the latter is given in Appendix F). 9 --- Page 10 --- Preprint. Under review. References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. Suraj Anand, Michael A Lepori, Jack Merullo, and Ellie Pavlick. Dual process learning: Controlling use of in-context vs. in-weights strategies with weight forgetting. In The Thirteenth International Conference on Learning Representations, 2025. Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G Araújo, Alex Vitvitskyi, Razvan Pascanu, and Petar Veliˇckovi´c. Transformers need glasses! in- formation over-squashing in language tasks. Advances in Neural Information Processing Systems, 37:98111–98142, 2024. Enric Boix-Adserà, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, and Joshua M Susskind. When can transformers reason with abstract symbols? In The Twelfth Interna- tional Conference on Learning Representations, 2024. Bryan Chan, Xinyi Chen, András György, and Dale Schuurmans. Toward understanding in-context vs. in-weight learning. In The Thirteenth International Conference on Learning Representations, 2025. Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers. Advances in neural information processing systems, 35:18878–18891, 2022a. Stephanie CY Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K Lampinen, and Felix Hill. Transformers generalize differently from information stored in context vs in weights. arXiv preprint arXiv:2210.05675, 2022b. Yihong Chen, Kelly Marchisio, Roberta Raileanu, David Adelani, Pontus Lars Erik Saito Stenetorp, Sebastian Riedel, and Mikel Artetxe. Improving language plasticity via pretraining with active forgetting. Advances in Neural Information Processing Systems, 36:31543–31557, 2023. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025. Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2016. Tianyu He, Darshil Doshi, Aritra Das, and Andrey Gromov. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. Advances in Neural Information Processing Systems, 37:13244–13273, 2024. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. 10 --- Page 11 --- Preprint. Under review. Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo J Taylor, and Dan Roth. A peek into token bias: Large language models are not yet genuine reasoners. arXiv preprint arXiv:2406.11050, 2024. Josh Kaufman. google-10000-english. urlhttps://github.com/first20hours/google-10000-english, 2021. [Accessed 2025-12-14]. Peter J. Liu, Roman Novak, Jaehoon Lee, Mitchell Wortsman, Lechao Xiao, Katie Everett, Alexander A. Alemi, Mark Kurzeja, Pierre Marcenac, Izzeddin Gur, Simon Kornblith, Kelvin Xu, Gamaleldin Elsayed, Ian Fischer, Jeffrey Pennington, Ben Adlam, and Jascha- Sohl Dickstein. Nanodo: A minimal transformer decoder-only language model imple- mentation in JAX., 2024. URL http://github.com/google-deepmind/nanodo. Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101, 5:5, 2017. Alan Malek, Jiawei Ge, Nevena Lazi´c, Chi Jin, András György, and Csaba Szepesvári. Frontier llms still struggle with simple reasoning tasks. arXiv preprint arXiv:2507.07313, 2025. R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Grif- fiths. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, 121(41): e2322420121, 2024a. R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Grif- fiths. When a language model is optimized for reasoning, does it still show embers of autoregression? an analysis of openai o1. arXiv preprint arXiv:2410.01792, 2024b. Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. In The Thirteenth International Conference on Learning Representations, 2024. Santiago Ontanon, Joshua Ainslie, Zachary Fisher, and Vaclav Cvicek. Making transformers solve compositional tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3591–3607, 2022. Madhur Panwar, Kabir Ahuja, and Navin Goyal. In-context learning through the bayesian prism. In The Twelfth International Conference on Learning Representations, 2024. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. Advances in neural information processing systems, 36:14228–14246, 2023. Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. arXiv preprint arXiv:2312.03002, 2023. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024. Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers. arXiv preprint arXiv:2305.16843, 2023. Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083, 2017. 11 --- Page 12 --- Preprint. Under review. Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang. Posi- tional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737, 2023. Aaditya Singh, Stephanie Chan, Ted Moskovitz, Erin Grant, Andrew Saxe, and Felix Hill. The transient nature of emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 36:27801–27819, 2023. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural informa- tion processing systems, 30, 2017. Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. Advances in neural information processing systems, 28, 2015. Honghua Zhang, Liunian Harold Li, Tao Meng, Kai-Wei Chang, and Guy Van den Broeck. On the paradox of learning to reason from data. In IJCAI, 2023. 12 --- Page 13 --- Preprint. Under review. A Proof of Lemma 4.1 Proof. Let pt i,n = p(i|xn, W(t), θ(t)). The weights satisfy W(t+1) j −W(t+1) i = (1 −ληt) W(t) j −W(t) i −ηt 1 N N ∑ n=1 pt j,n −pt i,n ϕt n (2) By the Lagrange mean-value theorem, exp(b) −exp(a) = exp(c)(b −a) for some c such that a < c < b. Assume without loss of generality that ϕt⊤ n W(t) j ≥ϕt⊤ n W(t) i . Setting a = ϕt⊤ n W(t) i −ln Zt(ϕt n) and b = xt⊤ n W(t) j −ln Zt(ϕt n), where Zt(ϕn) = ∑y exp(ϕ⊤ n W(t) y ) is the normalizer, we have that for some pt n such that min(pt i,n, pt j,n) ≤pt n ≤max(pt i,n, pt j,n) pt j,n −pt i,n = exp(ϕt⊤ n W(t) j −ln Zt(ϕt n)) −exp(ϕt⊤ n W(t) i −ln Zt(ϕt n)) = pt nϕt⊤ n W(t) j −W(t) i (3) Let pt = maxn pt n. Plugging (3) into (2) and using the fact that inputs lie in a ball of radius r gives the result. B Implementation details Architecture and training. Our implementation of transformer training and evaluation builds upon the NanoDO framework (Liu et al., 2024). For the simple logic experiments, we train a decoder-only transformer with 10 layers and embedding size 1024 split across 8 heads. With copy attention, we add a copying block with 8 heads. The architecture is as described in Section 3. We use MLPs with hidden size 1024 and GELU activations (Hendrycks & Gimpel, 2016). We use RoPE positional embeddings (Su et al., 2024). We train using the AdamW optimizer (Loshchilov et al., 2017) with decay 0.001, batch size 256, peak learning rate 0.0001, warmup and final rate of 0.00001, and 2000 warmup steps. We clip gradients by global norm 1. We use a custom tokenizer designed for logic data. For the single-token symbol experiments, the vocabulary includes |P| ∈{150, 600, 2400, 9600} zero-order predicates, the special tokens shown in Figure 6, pad token, and 100 additional unused tokens. The vocabulary size is further padded by unused tokens to a multiple of 4. For the multi-token symbol experiments, the vocabulary includes 26 baseline tokens and sequences of these tokens form predicates. For each predicate in each example, we sample its length i ∈[n] and tokens uniformly at random, while ensuring there are no duplicates. Facts EndFacts Rules EndRules Query EndQuery Answer Reasoning EndReasoning Newfact EndNewfact If then EndIf yes no and ? <BOS> <EOS> <PAD> <UNK> Figure 6: Special tokens in the vocabulary in addition to zeroth order predicates. Data. We generate propositional logic problems similarly to the rule-priority (RP) recipe described in Zhang et al. (2023). RP problems are generated by randomly sampling a subset of predicates, and randomly sampling facts and rules using those predicates. The facts and rules are then used to compute the truth values. To balance label probabilities, we select the query to be a true predicate with probability 0.5. The problems we generated had between 5 and 20 predicates and between 0 and 40 rules. The examples presented to the model are formatted as in Figure 7. We generated reasoning traces of the form in Figure 7 by running forward chaining until either the query predicate was proven true or no further predicates could be proven true, and generating text corresponding to the algorithm trace. 13 --- Page 14 --- Preprint. Under review. <BOS> Rules: If A and B then C EndIf If B then A EndIf Facts: B EndFacts Query: C ? Reasoning: Facts: B EndFacts If B then A EndIf Newfact A EndNewfact Facts: B and A EndFacts If A and B then C EndIf Newfact C EndNewfact Facts: B and A and C EndFacts Answer: yes EndAnswer <EOS> Figure 7: Formatting example for a propositional logic problem with reasoning used for training. The separator ’and’ is only used in the multi-token symbol experiments. 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (100, 10k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (1k, 10k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 10k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 10k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 10k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 10k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 10k), copy all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (100, 50k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (1k, 50k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 50k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 50k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 50k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 50k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 50k), copy all predicates unseen Figure 8: Propositional logic evaluation accuracy for models trained on single-token symbols with (temporary) active forgetting. C Additional experiments with temporary active forgetting We evaluate the temporary active forgetting approach of Anand et al. (2025) on logic problems with single-token symbols. Here the (un)embeddings are periodically reinitialized every k1 steps for the first k2 steps of training. We consider the following values of (k1, k2): (100, 10k), (1k, 10k), (100, 50k), (1k, 50k). Note that we train for 50k steps in total, so some of the settings correspond to the (non-temporary) active forgetting approach of Chen et al. (2023). The results are shown in Figure 8. We observe that temporary active forgetting does not load to the desired inductive bias (symbolic reasoning), unless combined with copy attention. Active forgetting throughout the training can generalize with the vanilla architecture, provided sufficiently high symbolic diversity. 14 --- Page 15 --- Preprint. Under review. Rules: If enchanting then UNK EndIf If perfect then silly EndIf [...] EndRules Facts: enchanting EndFacts Query: UNK EndQuery ? Reasoning: Facts: enchanting EndFacts If enchanting then perfect EndIf Newfact: perfect EndNewfact Facts: enchanting perfect EndFacts If enchanting then ugliest EndIf Newfact: ugliest EndNewfact Facts: enchanting perfect ugliest End- Facts If perfect then silly EndIf Newfact: silly EndNewfact Facts: enchanting perfect ugliest silly EndFacts EndReasoning Answer: no EndAnswer Rules: [...] EndRules Facts: hilarious perfect bored proud UNK EndFacts Query: UNK EndQuery ? Reasoning: Facts: hilarious perfect bored proud EndFacts EndReasoning Answer: no EndAnswer Figure 9: Examples of reasoning traces in the case of a single single-token symbolic variable. The model refuses to generate the unseen token in its reasoning trace, leading to CoT errors and eventually the wrong answer. Most mistakes occur when the true answer is ’yes’ and the model outputs ’no’. D Examples of model samples Figure 9 shows samples from the vanilla transformer trained on logic when the query (single-token) predicate is replaced by an unseen token. If the unseen token appears in rules, the model often replaces it by a seen token. If the unseen token appears in facts, the model often ignores it. Most model mistakes are false negatives. E Effect of freezing embeddings and copy attention on language modeling 0 20000 40000 60000 80000 100000 steps 3 4 5 6 7 8 9 10 C4 validation loss frozen WE 4 layers 8 layers 12 layers 16 layers 0 20000 40000 60000 80000 100000 steps 3 4 5 6 7 8 9 10 trainable WE 4 layers 8 layers 12 layers 16 layers 0 20000 40000 60000 80000 100000 steps 3 4 5 6 7 8 9 10 trainable WE, copy 4 layers 8 layers 12 layers 16 layers 0 20000 40000 60000 80000 100000 steps 0.0 0.2 0.4 0.6 0.8 trainable WE, vanilla - copy 4 layers 8 layers 12 layers 16 layers Figure 10: Evaluation loss on the C4 dataset with trainable and frozen embeddings and with copy attention. Freezing embeddings results in significantly higher validation loss compared to using trainable embeddings. Adding copy attention results in slightly smaller loss compared to the default architecture. To make the difference between the second and third figure visible, the last figure shows the difference in the validation loss of the second standard (vanilla) architecture and the one with the copy attention. In this section we train small transformers with 4, 8, 12, and 16 layers on the C4 dataset (Raffel et al., 2020), with trainable and frozen embeddings and with copy attention. We show the validation loss for all models in Figure 10. Freezing embeddings leads to considerably 15 --- Page 16 --- Preprint. Under review. worse evaluation loss and thus may not be a practical solution. Interestingly, copy attention has slightly better validation loss, especially early in the training and for small models. F Multi-token symbols Rules: If qwert, asdf then abcde. If zxcv then abcd. Facts: qwert, asdf. Query: abcd ? Figure 11: Stylized example where the multi-token query shares a prefix with another predicate. 0 20000 40000 60000 80000 100000 steps 0.5 0.6 0.7 0.8 0.9 1.0 accuracy train/test english n tokens, test n = 2 n = 3 n = 4 n = 5 0 20000 40000 60000 80000 100000 steps 0.5 0.6 0.7 0.8 0.9 1.0 train/test english n tokens, test, prefix 0 20000 40000 60000 80000 100000 steps 0.5 0.6 0.7 0.8 0.9 1.0 train/test english n + 1 tokens 0 20000 40000 60000 80000 100000 steps 0.5 0.6 0.7 0.8 0.9 1.0 train/test english n + 1 tokens, prefix Figure 12: Evaluation of models trained on symbols of up to n tokens on problems with symbols of length up to n, and length n + 1. ’prefix’ in figure title means that the query shares a prefix with another predicate, making them more easily confusable. While our work focuses primarily on unseen tokens, here we also conduct a small prelimi- nary study on unseen multi-token variables (where all tokens are seen, but not all combinations are seen). This case has not been studied systematically in literature to the best of our knowl- edge, but it is related to much existing works on perturbing reasoning benchmarks by changing names and numerical values and using synonyms (Jiang et al., 2024; Mirzadeh et al., 2024). While embedding collapse is no longer an issue here, we observe other prob- lems, including length generalization and difficulty in distinguishing similar symbols (such as those sharing a prefix), which we describe next. To evaluate symbolic reasoning with multi-token symbols, we train models with symbols of length up to n tokens for n ∈{2, 3, 4, 5}. We select the number of tokens for each symbol in a problem uniformly at random. We use b = 26 base tokens (corresponding to lowercase English alphabet letters). For each length 1, ..., n, we generate up to m = 1000 symbols as follows. Using a corpus of 10K most frequent English words (Kaufman, 2021), we estimate the probability of the first letter and the next letter given the current letter, using counts regularized with a pseudo-count of 1 (i.e., the Laplace estimator). We use these to generate new random words (symbols). We split the symbols of each length evenly into train and test sets. We ensure that the train two-token symbols contain all tokens; thus, there are no unseen tokens or collapsed embeddings in this experiment, eliminating the problems stemming from the undesired behavior of the embeddings. We evaluate models on the test symbols of length up to n and on new symbols of length n + 1. Additionally, to make evaluation more adversarial, we create examples where the query shares a prefix with the longest non-query predicate (e.g., ’abcde’ and ’abcdf’), making the two more easily confusable. The results are shown in Figure 12. While models generalize well to problems involving test symbols of lengths seen during training, performance gets worse for symbols that are longer by even one token. By inspecting the samples containing errors, we observe the following qualitative behaviors. Models trained with n = 2 and n = 3 tend to truncate symbols to n tokens; these errors also occur for higher n, but less frequently. This finding may be of practical interest, as typical math datasets involve short variable names that are unlikely to be tokenized to more than 3 tokens, and similar truncations have also been 16 --- Page 17 --- Preprint. Under review. observed in large state-of-the-art models (Malek et al., 2025). For n ≥4, accuracy is lower on n + 1-token symbols when the query shares a prefix with another variable, suggesting that their representations start to collapse (this phenomenon was studied in more detail by Barbero et al. (2024)). Models evaluated on 5-token or 6-token symbols also occasionally hallucinate – go into a loop repeating a set of symbols. We speculate that this behavior is induced by long unlikely sequences of tokens. Thus, while using multi-token variables and using the same symbol tokens at train and test time resolves some of the issues studied in our work (specifically, embedding collapse and generating unseen tokens), it also comes with its own set of issues requiring different interventions, and is an interesting direction for future work. G Additional Gemma experiments 0 250 500 750 1000 singular value rank i 0.00 0.01 0.02 0.03 0.04 0.05 IT(i) PT(i) Gemma3 1B 0 500 1000 1500 2000 2500 i 0.0025 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 Gemma3 4B 0 1000 2000 3000 4000 i 0.004 0.003 0.002 0.001 0.000 0.001 0.002 0.003 Gemma3 12B 0 1000 2000 3000 4000 5000 i 0.004 0.003 0.002 0.001 0.000 0.001 0.002 Gemma3 27B Figure 13: Difference between the sorted singular values of the Frobenius-normalized (un)embeddings of instruction-tuned (IT) and pre-trained (PT) Gemma 3 models. We observe that the 1B, 4B, and 12B IT models have larger top two singular values than the PT versions, suggesting some collapse during the IT process. The 27B model has a larger top singular value. The effect is generally weaker in larger models which have a larger embedding dimension. 17

Chunks

Chunk 0

--- Page 1 --- Preprint. Under review.

Chunk 1

To See the Unseen: on the Generalization Ability of Trans- formers in Symbolic Reasoning Nevena Lazi´c Liam Fowl András György Csaba Szepesvári Abstract We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reason- ing problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that were not ob- served during training, and it was shown that one reason behind this is the difficulty of copying (or generating) unseen tokens.

Chunk 2

We show both theoretically and empirically that a particular representational collapse also has a crucial role: the unembeddings (last-layer weights) of unseen tokens collapse to nearly the same vector during training. The collapse makes distinguishing multiple unseen variables difficult for the model (especially when the embedding and unembedding parameters are shared), and pro- vides a mechanistic explanation for the effectiveness of existing heuristic interventions like "active forgetting", which periodically reset the token (un)embeddings.

Chunk 3

Based on these observations, we devise a combination of techniques, involving a small architecture change facilitating copying, data diversity, and freezing or resetting (un)embeddings, that achieves general- ization to unseen tokens. We support our claims with extensive controlled experiments on propositional logic reasoning problems.

Chunk 4

Beyond synthetic experiments, we also observe evidence of (un)embedding collapse in the open-weight models in the Gemma 3 family, which includes 99 unused tokens reserved for downstream use. Empirically we find that the corre- lated embeddings of these tokens are a poor initialization for finetuning applications.

Chunk 5

1 Introduction Modern transformer-based large language models (LLMs) (Vaswani et al., 2017) trained using next-token prediction have achieved significant success across a wide range of tasks, including benchmarks requiring formal reasoning such as mathematics and science. At the same time, researchers continue to uncover simple ways in which transformers fail on reasoning problems – for example, by changing names and numerical values or inserting irrelevant information (Jiang et al., 2024; Mirzadeh et al., 2024).

Chunk 6

This fragility suggests that LLMs are not yet capable of fully general reasoning and can rely on superficial patterns to produce answers. In this work, we investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning.

Chunk 7

Specifically, we study the models’ ability to execute in-context algorithms on inputs containing arbitrary new tokens as symbols (see Figure 1 for an example). Thus, the models need to generalize solely based on the problem structure and not based on the content of token embeddings.

Chunk 8

This problem was studied by Boix-Adserà et al. (2024) in the neural tangent kernel (NTK) regime, with only the (un)embedding1 parameters trainable.

Chunk 9

They showed that transformers can learn to reason with unseen tokens given 1We consider transformers with tied embedding and unembedding (first- and last-layer, resp.) matrices, which we will refer to as the (un)embedding parameters. Here the same parameter matrix is used for the embedding and unembedding layers, the latter being a transpose of the former.

Chunk 10

We will refer to these parameters as the (un)embedding. This architecture choice is used in Boix-Adserà et al.

Chunk 11

(2024) and also commonly used in practice, for example in the Gemma 3 models (Gemma Team et al., 2025). 1 arXiv:2604.21632v1 [cs.AI] 23 Apr 2026 --- Page 2 --- Preprint.

Chunk 12

Under review. Rules: If A then B.

Chunk 13

Facts: A. Query: B ?

Chunk 14

Reasoning: Facts: A. If A then B.

Chunk 15

Facts: A, B. Answer: yes Rules: If A then U.

Chunk 16

Facts: A. Query: U?

Chunk 17

Rules: If U1 then U2. Facts: U1.

Chunk 18

Query: U2? Figure 1: Left: An example of a propositional logic problem with reasoning used for training.

Chunk 19

Given a set of rules (definite clauses), a set of facts (true zero-order predicates), and a query predicate, the goal is to compute the truth value of the query. The reasoning trace (shown above) executes the forward-chaining algorithm, and stops if the query is proven true.

Chunk 20

Right: Two types of test examples. At test time, we use unseen symbols (denoted here by U, U1, U2) for either just the query variable, or for all variable names.

Chunk 21

highly diverse training data, but fail at tasks that require copying (or generating) unseen tokens, and proposed a simple architecture change to improve copying capabilities. Anand et al.

Chunk 22

(2025) studied generalization to unseen tokens empirically and found that periodically resetting the embeddings (or "active forgetting") during training helps, but did not provide an explanation beyond not storing information in the embeddings. Furthermore, they did not consider tasks requiring symbolic copying, which remains difficult with their approach.

Chunk 23

Our work identifies a key issue in generalization to unseen tokens that has been overlooked by previous work. We find that the (un)embeddings of the unseen tokens collapse to nearly the same vector during the course of training.

Chunk 24

This leads transformers to struggle when unseen tokens are used to represent different variable names, as the corresponding variables become nearly indistinguishable. We prove theoretically that this collapse is inevitable when training transformers using SGD with weight decay and layernorm, and provide empirical evidence of it occurring in practice.

Chunk 25

Our work shows that the architecture change proposed by Boix-Adserà et al. (2024) is ineffective on its own in the presence of multiple unseen variables.

Chunk 26

The identified embedding collapse also explains the usefulness of the periodic resetting strategy of Anand et al. (2025).

Chunk 27

We also show that reliable symbolic reasoning can be achieved using a combination of a copy-enabled architecture, high symbolic diversity in the data, and either freezing or periodically resetting (un)embeddings to prevent collapse. We validate our findings using extensive controlled experiments on a testbed of synthetic propositional logic problems similar to the examples in Figure 1.

Chunk 28

Beyond experiments on synthetic data, we also investigate the implications of our findings on larger open-weights models. We find that (un)embedding collapse is also present in the Gemma 3 model family (Gemma Team et al., 2025), which has 99 deliberately unused tokens for downstream use.

Chunk 29

We demonstrate empirically that the correlated embeddings of these tokens are a poor initialization for downstream finetuning applications. 2 Related Work Symbolic reasoning Several previous works have demonstrated that transformers suffer from token bias (Jiang et al., 2024): perturbing certain input tokens can induce predictable changes in the LLM output.

Chunk 30

Jiang et al. (2024) show that changing names, inserting celebrity references and irrelevant content, and replacing quantifiers with synonyms degrades LLM performance on logical fallacies.

Chunk 31

Mirzadeh et al. (2024) introduce a version of the GSM8k mathematical reasoning benchmarks with modified numerical values and inserted irrelevant information.

Chunk 32

McCoy et al. (2024a;b) show that the accuracy of LLMs is heavily influenced by the likelihood of task formulations, inputs, or outputs appearing in the training data.

Chunk 33

All these works suggest that transformers can rely on statistical shortcuts rather than performing true reasoning. Boix-Adserà et al.

Chunk 34

(2024) study the ability of transformers to reason with abstract symbols on relational tasks, in the neural tangent kernel (NTK) limit. They show that on regression-type tasks (computing a numerical label based on a symbol pattern), transformers can generalize to unseen tokens when trained on data with a large amount of symbolic substitution.

Chunk 35

On the other hand, they show that tasks involving generating an unseen token (such as copying from context) are considerably more difficult and may require architectural changes. Anand et al.

Chunk 36

(2025) define structural in-context learning as the ability 2 --- Page 3 --- Preprint. Under review.

Chunk 37

of a model to execute in-context learning (ICL) on arbitrary novel tokens, which is similar to symbolic reasoning. They study structural ICL on simple part-of-speech classification problems, and find that it is transient.

Chunk 38

They observe that it can be maintained with an active forgetting strategy (Chen et al., 2023), which resets the embedding matrix every k steps, and propose temporary active forgetting, where this is only done at the beginning of training. However, their work does not explain any mechanisms behind the failure.

Chunk 39

Copying architectures Multiple works explore neural architectures that enable copying of inputs, where attending to input tokens increases the probability of copying them. Ontanon et al.

Chunk 40

(2022) mix a distribution over input tokens with the standard transformer predictive distribution and find that this is helpful in solving compositional tasks that involve symbolic reasoning. Boix-Adserà et al.

Chunk 41

(2024) mix the copying component at the level of logits, and our work studies a similar architectural change. Similar copying mechanisms to these have previously been studied in sequence-to-sequence models, starting with the pointer networks of Vinyals et al.

Chunk 42

(2015) (see, e.g., See et al., 2017; Gu et al., 2016). In-context learning (ICL) Token bias can partially be attributed to models inappropriately using in-weights knowledge in place of context information.

Chunk 43

A recent line of works examines the effect of distributional properties of the training data on the emergence of ICL (Chan et al., 2022a; Reddy, 2023; Singh et al., 2023; Chan et al., 2025). These works find that ICL can emerge in transformers when there is a large number of classes and large within-class variation.

Chunk 44

Chan et al. (2022a;b) further observe that ICL and in-weight learning (IWL) can emerge simultaneously when the distribution over classes is Zipfian.

Chunk 45

Chan et al. (2025) propose a simplified model that uses a gating mechanism to choose between IWL and ICL predictors based on test error.

Chunk 46

Since the test error is unavailable, it is approximated by the best choice in hindsight based on the difference between the losses of the two predictors. Multiple works have noticed that ICL can become transient in an asymptotic training regime (Chan et al., 2022a;b; 2025; Panwar et al., 2024).

Chunk 47

In the setting of ICL for ridge regression, Raventós et al. (2023) show that models trained on a small number of tasks behave like the Bayesian estimator over the training-task distribution, but is able to generalize to unseen tasks (implementing essentially ridge regression) when the number of training tasks is sufficiently large.

Chunk 48

He et al. (2024) study ICL for modular arithmetic and find that models transition from in-distribution to out-of-distribution generalization as the number of pretraining tasks increases.

Chunk 49

A common observation in these papers is that data diversity is helpful for ICL, and we observe the same in our problems. Chain-of-thought (CoT) reasoning CoT reasoning enables LLMs to generate intermediate computational steps prior to finalizing the answer to a question.

Chunk 50

CoT has played a key role in LLMs achieving exceptional performance on tasks such as mathematical problem-solving and code generation (Chowdhery et al., 2023; Achiam et al., 2023; Anil et al., 2023; Trinh et al., 2024; Romera-Paredes et al., 2024). Here, we study problems in which CoT reasoning requires computing quantities based on previously unseen symbolic variables, as well as referring to unseen symbolic variables by copying them.

Chunk 51

Randomized embeddings Randomized position embeddings have been shown to help with length generalization in arithmetic tasks (Ruoss et al., 2023; Shen et al., 2023). This is because conventional positional embeddings are out-of-distribution when increasing the sequence length beyond the training-time window.

Chunk 52

Our work shows that using randomized embeddings for all tokens can be helpful in generalization to unseen tokens. 3 Transformer architecture Given a vocabulary V, a decoder-only transformer with parameters θ maps a sequence of tokens (x1, .

Chunk 53

. .

Chunk 54

, xn) ∈Vn to a probability distribution over V, denoted by pθ(·|x1, . .

Chunk 55

. , xn).

Chunk 56

Let X be an n × |V| matrix of stacked indicator vectors for the input tokens. The architecture of an L-block transformer with a single attention head per block is as follows.2 The tokens 2We describe a single attention head for simplicity of exposition.

Chunk 57

Our empirical results rely on multi-head attention. 3 --- Page 4 --- Preprint.

Chunk 58

Under review. are first embedded into a d-dimensional space using the embedding/unembedding matrix WE as Y0 = XWE.

Chunk 59

The L transformer blocks have the following form, defined recursively over the transformer blocks i = 0, 1, 2, . .

Chunk 60

. , L −1: Y0 = XWE, Ai = attni(layernorm(Yi)), Yi+1 = Yi + Ai + MLPi(layernorm(Yi + Ai)) where • layernorm (Ba et al., 2016) normalizes the per-example activations of a layer to have mean and standard deviation close to 0 and 1, respectively, and optionally rescales them by a learned parameter; • MLPi is a feed-forward network mapping its input as MLPi(Y) = max(0, YW(i) 1 + b(i) 1 )W(i) 2 + b(i) 2 , where W(i) 1 , b(i) 1 , W(i) 2 , and b(i) 2 are trainable; • attni computes a linear projection of the inputs into query, key, and value matrices using trainable parameters W(i) Q , W(i) K , W(i) V , respectively.

Chunk 61

For each token, the query and key are used to compute a weighted combination of the values of the preceding tokens: attni(Y) = softmax(M ⊙(YW(i) Q W(i)⊤ K Y⊤)/ √ d)YW(i) V W(i) O where M is a causal mask and W(i) O is the output projection matrix. We incorporate position information by applying RoPE (Su et al., 2024) to the query and key matrices.

Chunk 62

The next-token probabilities are computed as p(·|X) = softmax(layernorm(YL)W⊤ E ). This is an n × |V| matrix, whose last row gives the probability of the (n + 1)st token given x1, .

Chunk 63

. .

Chunk 64

, xn. Note that we assume that the embedding and unembedding matrices are tied (WE and W⊤ E , resp.).

Chunk 65

Copy attention heads As discussed earlier, copying symbols from the input to the output is often needed when reasoning about in-context information (such as in the logic tasks of Figure 1). Unfortunately, this can be challenging in the case of unseen tokens that models have been trained not to generate.

Chunk 66

Here we describe a minor architecture change that is helpful in copying both seen and unseen symbols. Let Aj denote the jth row of a matrix A.

Chunk 67

Assuming that the rows of WE are almost orthogonal and almost normalized, a transformer will copy a token xi to position j + 1 if layernorm(YL)j ≈WE,xi. However, it may be difficult for the transformer to propagate an arbitrary input embedding xi through L layers involving MLPs.

Chunk 68

We thus augment the architecture with a shortcut to the input embeddings, which we name copy attention, implemented as follows: copyattn(Y) = softmax(M ⊙(YWQW⊤ K Y⊤)/ √ d)XWE. Compared to standard attention, we replace the value and output projection YWVWO with the input embeddings XWE.

Chunk 69

Thus, for each i, the ith row of copyattn(Y) will be a weighted combination of the input embeddings of tokens x1, ..., xi. We now compute the next-token distribution as Z = copyattn(YL−1), pcopy(·|X) = softmax(layernorm(YL + Z)W⊤ E ).

Chunk 70

Thus, ignoring layernorm, if the embeddings WE are approximately orthonormal and if the copy head sharply attends to a token, it will increase its output probability. Empirically we observe similar benefits of copy attention with and without output layernorm.

Chunk 71

The described copy attention is a modified version of the architecture change proposed by Boix-Adserà et al. (2024) for a single-layer transformer without MLPs, which outputs attn(X) = softmax(M ⊙(XWEWQW⊤ K W⊤ E X⊤)/ √ d)XWE(WVWO + bI), where b is a train- able scalar and I is the identity matrix.

Chunk 72

Effectively, this change adds the scaled input embeddings bXWE to the standard value-output parameters WVWO. The main difference in our implementation is that we use separate attention heads for copying (i.e.

Chunk 73

additional query and key parameters). This enables us to keep the MLPs in the standard attention blocks while not applying them to copy attention, and to trade-off between copying and standard generation more flexibly than via a single parameter.

Chunk 74

4 --- Page 5 --- Preprint. Under review.

Chunk 75

4 Collapse of softmax weights for unseen labels As we will demonstrate, transformers struggle with symbolic reasoning that requires copy- ing unseen tokens, and one of the reasons for this failure is the collapse of the unembeddings corresponding to those tokens. In this section we provide some insight into this phenomenon.

Chunk 76

Specifically, we show that once the model fits the data sufficiently well (to be specified next), each step of ℓ2-regularized gradient descent (GD) or stochastic gradient descent (SGD) will decrease the distance between the unembeddings of unseen tokens. In what follows, we consider the general problem of training a softmax classifier on a dataset D = {(xn, yn)}N n=1 of inputs x ∈Rd′ and labels y ∈[K].

Chunk 77

Let ϕ(·|θ) : Rd′ →B(r, d) be a neural network, parameterized by θ, mapping the input features to a d-dimensional ball of radius r, B(r, d) = {ψ ∈Rd : ∥ψ∥≤r}. We will use the shorthand ϕn := ϕ(xn|θ) and ϕt n := ϕ(xn|θ(t)), where θ(t) are the parameters at training step t.

Chunk 78

We consider a model in which the probability of a label is given by a softmax: p(y|x, W, θ) = exp(ϕ(x|θ)⊤Wy) ∑K k=1 exp(ϕ(x|θ)⊤Wk) (1) where Wk are the weights corresponding to class k. The parameters (W, θ) are optimized by minimizing the ℓ2-regularized negative log-likelihood with some regularization coefficient λ > 0: Lλ(W, θ|D) = −1 N N ∑ n=1 ln p(yn|xn, W, θ) + 0.5λ∥W∥2 F The gradient of the loss with respect to parameters Wi is gi := 1 N ∑N n=1(p(i|xn, W, θ) −I[i = yn])ϕn + λWi, where I[i = yn] is a label indicator function.

Chunk 79

4.1 Contraction of unseen-label weights for GD/SGD Let W(t) denote the softmax weights at step t. Suppose that none of the examples in the dataset have the label i or label j.

Chunk 80

The following lemma shows that in this case the parameter vectors Wi and Wj may contract: Lemma 4.1. Suppose that we update the weights of the softmax classifier in Eq (1) using gradient descent with learning rate schedule ηt.

Chunk 81

If labels i and j are not present in the data, then the weights at time t satisfy W(t+1) j −W(t+1) i ≤ 1 −ληt + ηtr2pt W(t) j −W(t) i where pt := maxk∈{i,j},n∈[N] p(k|xn, W(t), θ(t)). The proof is given in Appendix A.

Chunk 82

The same result holds for stochastic gradient descent by replacing the full batch with a minibatch, and having pt be the maximum probability of any unseen class for any minibatch example. We note that the result holds regardless of whether the parameters θ(t) are fixed or changing over time, as long as the feature function ϕ(·|θ(t)) has bounded output.

Chunk 83

Thus, Lemma 4.1 applies, for example, to transformers with layernorm in the last layer. Lemma 4.1 implies that the weights for a pair of unseen labels will contract whenever r2pt < λ.

Chunk 84

Furthermore, if our learning rate ηt is bounded away from zero, and assuming that the classifier has sufficient capacity to fit the data so that the probability of unseen labels i, j approaches zero for training examples, we see the corresponding weight entries collapse. Thus we have shown that a step of ℓ2-regularized GD/SGD will contract the last-layer weights of unseen labels if the input features have bounded norm and the model fits the data well, even if the features are changing over time.

Chunk 85

This setting corresponds to modern deep neural networks which overfit training data and use layer normalization to ensure boundedness of the learned features. 5 --- Page 6 --- Preprint.

Chunk 86

Under review. 0 10000 20000 30000 40000 50000 steps 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 mean abs cos similarity seen tokens, cos sim = 0, no layernorm = 0, layernorm = 0.01, layernorm = 0.1, layernorm 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 mean abs cos similarity unseen tokens, cos sim Figure 2: Mean pairwise cosine similarities of the unembeddings for different hyperparame- ter choices (note the different scale of the vertical axis in the two graphs).

Chunk 87

In the presence of layernorm, the unembeddings of unseen tokens collapse to nearly the same-direction vector (also note that the actual magnitudes are also very close). This is consistent with Lemma 4.1 which requires the inputs to the softmax function to be bounded (i.e., normalized).

Chunk 88

Collapse is more severe for higher regularization λ. 4.2 Empirical collapse of unseen-token (un)embeddings under AdamW While our analysis considers SGD, in practice transformers are often trained using the AdamW optimizer (Loshchilov et al., 2017), for which obtaining similar results is more involved.

Chunk 89

In this section we empirically demonstrate that similar collapse also happens when training transformers using AdamW. In particular, we train transformers on logic problems using a vocabulary with 100 deliberately heldout tokens (see Section 5 for full experimental details), using AdamW with different regularization strengths λ, with and without layernorm in the last layer.

Chunk 90

Figure 2 shows the mean cosine similarity between the (un)embeddings of seen tokens, and between the (un)embeddings of unseen tokens. We observe that unseen tokens are much more correlated than seen tokens, especially in the presence of layer normalization, even without regularization (λ = 0).

Chunk 91

5 Experiments on symbolic propositional logic In this section we study how transformers can learn to reason in the context of proposi- tional logic problems, and demonstrate the issues and solutions highlighted earlier about generalization to unseen symbols. Propositional logic problem We create synthetic propositional logic problems involving only definite clauses (also known as Horn clauses).

Chunk 92

The goal is to determine the truth value of a query predicate given a set of rules (definite clauses, e.g., ’If A then B’) and facts (true zero-order predicates, e.g., ’A is true’). We generate random synthetic problems similarly to the rule-priority recipe of Zhang et al.

Chunk 93

(2023), but additionally include chain-of-thought (CoT) reasoning that can be used to derive the answer. The CoT is generated by running the forward-chaining algorithm, which iteratively applies the modus ponens rule to derive new facts until either the query predicate is proven or no further predicates can be proven true.

Chunk 94

If a predicate cannot be proven true, it is false. See Figure 1 for an example and data format.

Chunk 95

For logic problems involving only Horn clauses, modus ponens is known to be sound and complete: any predicate entailed by the rules and facts can be derived by iteratively applying modus ponens. The generated examples contain up to 20 predicates per question and up to 40 rules per question.

Chunk 96

We ensure balanced labels by selecting a true predicate as the query with probability 0.5. Model We train a small 10-layer decoder-only model (60M-75M parameters, depending on the number of predicates) from scratch on this data using the NanoDO library (Liu et al., 2024) and the AdamW optimizer (Loshchilov et al., 2017) with weight decay λ = 0.001.

Chunk 97

Rather than using a fixed dataset, we continually generate fresh examples on-the-fly during training. We only train the model to predict the tokens corresponding to the reasoning and 6 --- Page 7 --- Preprint.

Chunk 98

Under review. 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy trainable WE all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 trainable WE unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 trainable WE all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 trainable WE, copy attn all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 trainable WE, copy attn unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 trainable WE, copy attn all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy frozen WE all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 frozen WE unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 frozen WE all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 frozen WE, copy attn all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 frozen WE, copy attn unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 frozen WE, copy attn all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (100, 50k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (100, 10k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy all predicates unseen Figure 3: Propositional logic evaluation accuracy for models trained on single-token symbols.

Chunk 99

We evaluate accuracy with (1) all tokens seen, (2) query unseen, (3) all predicates unseen. The rows correspond to the unembeddings being (1) trainable (2) frozen (3) periodically reinitialized every 100 steps throughout training, and (4) periodically reinitialized every 100 steps for the first 10k steps.

Chunk 100

We train each model with and without copy attention (right and left side of the figure). the answer, that is, we mask out the question tokens.3 We experiment both with standard architectures and copy attention, using a custom vocabulary to control the tokenization of the problems.

Chunk 101

See Appendix B for further details on the data, architecture, and training. Experiment settings In this setting, each logic predicate corresponds to a single token.

Chunk 102

The predicates for each problem are selected uniformly at random from a set P (and thus the size of P serves as a proxy for symbolic diversity). The vocabulary for each training task includes 100 unused tokens which are subsequently used as predicates during evaluation.

Chunk 103

We train models with and without copy attention, and with trainable, frozen, and periodically reinitialized (un)embeddings. We vary the size of the training predicate vocabulary ∥P∥∈{150, 600, 2400, 9600} as a proxy for diversity.

Chunk 104

We evaluate the accuracy of the trained models on a randomly-generated set of 2048 examples with (1) all tokens seen, (2) query unseen, and (3) all predicates unseen. Results and discussion The results are shown in Figure 3.

Chunk 105

All model versions perform well on in-distribution data (seen tokens). The standard transformer fails when one or more symbols are replaced by unseen tokens.

Chunk 106

Examining outputs reveals that this is because the model learns not to generate the unseen token (needed for reasoning). The model instead either ignores the token or replaces it with a seen predicate (see Figure 9 for some examples of generated CoT in this case).

Chunk 107

Adding copy attention and increasing diversity results in good performance on the evaluation with a single unseen token. In fact, early in the training, models with vocabulary size 600, 2400, and 9600 all generalize well and copy the unseen 3In our initial experiments, we found that training the model to predict the randomly-generated questions in addition to answers led to worse performance and increased hallucinations in the CoT.

Chunk 108

7 --- Page 8 --- Preprint. Under review.

Chunk 109

token in their CoT, but only the model with |P| = 9600 keeps this inductive bias. This result is consistent with existing works on in-context learning (ICL), which show that a certain diversity threshold is necessary for generalization (Raventós et al., 2023; He et al., 2024), and that ICL is transient (Chan et al., 2022a; Panwar et al., 2024; Chan et al., 2025).

Chunk 110

Copy attention fails in the case where all symbolic variables are replaced by unseen tokens, regardless of diversity. This is explainable at least in part by the collapse of the unseen (un)embeddings, which makes these variables difficult to distinguish.

Chunk 111

Freezing the embeddings to their initialization during training (which are approximately orthogonal and hence well-distinguishable for different tokens) allows all models to gen- eralize to unseen tokens, with the copy-attention models achieving higher accuracy (or having a lower diversity threshold). Similarly, active forgetting (Chen et al., 2023) (reinitial- izing embeddings every 100 steps) generalizes to unseen tokens given sufficient diversity and/or an architecture with copy heads.4 These results further support our hypothesis that symbolic reasoning is impeded by learning not to emit unseen tokens and the unembedding collapse.5 Unfortunately, freezing and reinitializing are somewhat unsatisfactory solutions for general-purpose models, which typically benefit from trainable embeddings.6 Finally, we evaluated temporary active forgetting (Anand et al., 2025) (reinitializing for the first 10k steps of the training).

Chunk 112

We observed that it does not generalize well for the standard archi- tecture (no copy attention) – performance collapses once reinitialization stops. However, it does generalize well with the copy attention architecture for all but the smallest vocabulary P.

Chunk 113

While the unseen unembeddings again collapse in this setting (having a cosine similarity of about 0.9), the model with copy attention appears to attend to tokens sharply enough to distinguish different symbols. 6 Experiments on Gemma3 In this section, we investigate whether our findings related to the collapse of embeddings hold for larger (open-weights) models, namely the Gemma 3 series (Gemma Team et al., 2025).

Chunk 114

These models use tied embedding and unembedding parameters, as assumed throughout. Collapse of unused tokens The Gemma 3 tokenizer vocabulary is of size 262,144 and includes 99 deliberately unused tokens.

Chunk 115

For each instruction-tuned model size (1B, 4B, 12B, 27B), with embedding dimensions (1152, 2560, 3840, 5376) respectively, we compute the cosine similarities of the embeddings of the unused tokens and a randomly chosen subset of 100 used tokens. The results are shown in Figure 4: in all models, the unused tokens have a higher cosine similarity than the used ones.

Chunk 116

The mean cosine similarities are (0.78, 0.33, 0.23, 0.24) for the unused tokens and (0.09, 0.03, 0.02, 0.01) for the used tokens.7 Thus the effect of the collapse is smaller for larger models. Finetuning with unused tokens To investigate whether the collapse of unused tokens is an issue for downstream applications relying on such tokens, we finetune Gemma 3 1B IT on propositional logic data constructed similarly as before, with 80 symbols (predicates) corresponding to (1) unused tokens, (2) English adjectives, and (3) ASCII lowercase and uppercase letters and repeated letters (e.g., ’aa’).

Chunk 117

We verify that all symbols correspond to single tokens. We finetune all parameters using AdamW with a constant learning rate 5 · 10−5 and parameters β1 = 0.9, β2 = 0.999, ϵ = 10−8, λ = 10−4, using batch size of 128.

Chunk 118

We show evaluation accuracy on a holdout dataset of 512 examples (with the same 4We tried reinitializing every 100 and every 1k steps, for 10k steps or throughout training; see Appendix C for full experiments. 5We have also tried freezing / reinitializing only the embeddings of the predicate tokens.

Chunk 119

Contrary to our intuition, this yielded worse generalization compared to freezing / reinitializing all embeddings, and requires further investigation. 6See also Figure 10 in the Appendix E for some empirical evidence in this regard, where we compare models with trainable and frozen embeddings on the C4 corpus (Raffel et al., 2020).

Chunk 120

7For the pretrained models, the mean cosine similarities are (0.79, 0.43, 0.33, 0.29) for the unused tokens and (0.06, 0.03, 0.02, 0.01) for the used tokens 8 --- Page 9 --- Preprint. Under review.

Chunk 121

Gemma3 1B IT Gemma3 4B IT Gemma3 12B IT Gemma3 27B IT 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Figure 4: Cosine similarities between the embeddings of the 99 unused tokens and 100 randomly chosen tokens. In all models, the unused tokens are correlated more than the used ones, with the effect more pronounced the smaller the embedding dimension is.

Chunk 122

0 1000 2000 3000 4000 5000 step 0.0 0.2 0.4 0.6 0.8 1.0 accuracy unused 0 1000 2000 3000 4000 5000 step 0.0 0.2 0.4 0.6 0.8 1.0 accuracy adjectives 0 1000 2000 3000 4000 5000 step 0.0 0.2 0.4 0.6 0.8 1.0 accuracy letters Figure 5: Accuracy when finetuning Gemma 3 1B IT on logic problems with predicates corresponding to unused tokens, English adjectives, and letters. Curves correspond to random seeds.

Chunk 123

With adjectives and letters, accuracy reaches 90% in about 500 finetuning steps, whereas with unused tokens reaching 90% accuracy requires about 5000 steps, which may be an artifact of embedding collapse. predicates as the trained model) in Figure 5 for 4 random training runs.

Chunk 124

We observe that training using unused tokens can be slow: the runs using adjectives and letters as symbols reach ∼90% accuracy in under 500 steps, while the runs involving unseen tokens take about 5000 steps to reach similar performance. The runs involving letters are the most stable, perhaps because these are more commonly used as variable names in the training data.

Chunk 125

Collapse of rare tokens during instruction tuning Next we investigate whether collapse of embeddings occurs during conventional instruction tuning of models. We compute the singular value decomposition (SVD) of the (un)embedding matrices of Gemma 3 models, normalized by their Frobenius norm.

Chunk 126

In Figure 13, we plot the difference between the sorted singular values of IT (instruction-tuned) and PT (pre-trained) models. In all models, the IT version places more mass on the top two singular values than the corresponding PT version, with the effect most pronounced in the 1B version.

Chunk 127

We observe that tokens whose cosine similarity increases between PT and IT versions of Gemma 3 1B often correspond to html tags and non-Latin script symbols; for example the cosine similarity between </h6> and );"> increases from -0.044 to 0.273. Thus it is plausible that these tokens are not included or rarely seen in the instruction tuning data, though we cannot definitively verify this claim.

Chunk 128

We speculate that this increase in correlation does not meaningfully degrade capabilities. 7 Discussion We have investigated the ability of transformers to reason with abstract symbols represented using previously-unseen tokens.

Chunk 129

We found that generalization to unseen tokens is hindered by an inductive bias that causes their (un)embeddings to collapse, and proposed methods for addressing this collapse. Our work resolves the issues with architectural interventions proposed by Boix-Adserà et al.

Chunk 130

(2024) in the presence of multiple unseen variables, and also helps explain the benefits of temporary active forgetting (Anand et al., 2025). We have also observed some evidence of collapse in the open-weight Gemma 3 models, and found that finetuning these models is considerably slower with unseen tokens.

Chunk 131

Some limitations of our work are that the theoretical analysis of the collapse includes additional assumptions and that most of our experiments are on small models and only on propositional logic problems. Interesting directions for future work include strategies for improving finetuning with collapsed unused tokens, as well as investigating reasoning with multi-token symbols (a preliminary study on the latter is given in Appendix F).

Chunk 132

9 --- Page 10 --- Preprint. Under review.

Chunk 133

References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.

Chunk 134

arXiv preprint arXiv:2303.08774, 2023. Suraj Anand, Michael A Lepori, Jack Merullo, and Ellie Pavlick.

Chunk 135

Dual process learning: Controlling use of in-context vs. in-weights strategies with weight forgetting.

Chunk 136

In The Thirteenth International Conference on Learning Representations, 2025. Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al.

Chunk 137

Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.

Chunk 138

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.

Chunk 139

arXiv preprint arXiv:1607.06450, 2016. Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G Araújo, Alex Vitvitskyi, Razvan Pascanu, and Petar Veliˇckovi´c.

Chunk 140

Transformers need glasses! in- formation over-squashing in language tasks.

Chunk 141

Advances in Neural Information Processing Systems, 37:98111–98142, 2024. Enric Boix-Adserà, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, and Joshua M Susskind.

Chunk 142

When can transformers reason with abstract symbols? In The Twelfth Interna- tional Conference on Learning Representations, 2024.

Chunk 143

Bryan Chan, Xinyi Chen, András György, and Dale Schuurmans. Toward understanding in-context vs.

Chunk 144

in-weight learning. In The Thirteenth International Conference on Learning Representations, 2025.

Chunk 145

Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers.

Chunk 146

Advances in neural information processing systems, 35:18878–18891, 2022a. Stephanie CY Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K Lampinen, and Felix Hill.

Chunk 147

Transformers generalize differently from information stored in context vs in weights. arXiv preprint arXiv:2210.05675, 2022b.

Chunk 148

Yihong Chen, Kelly Marchisio, Roberta Raileanu, David Adelani, Pontus Lars Erik Saito Stenetorp, Sebastian Riedel, and Mikel Artetxe. Improving language plasticity via pretraining with active forgetting.

Chunk 149

Advances in Neural Information Processing Systems, 36:31543–31557, 2023. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al.

Chunk 150

Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.

Chunk 151

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.

Chunk 152

arXiv preprint arXiv:2503.19786, 2025. Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li.

Chunk 153

Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

Chunk 154

Association for Computational Linguistics, 2016. Tianyu He, Darshil Doshi, Aritra Das, and Andrey Gromov.

Chunk 155

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. Advances in Neural Information Processing Systems, 37:13244–13273, 2024.

Chunk 156

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).

Chunk 157

arXiv preprint arXiv:1606.08415, 2016. 10 --- Page 11 --- Preprint.

Chunk 158

Under review. Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo J Taylor, and Dan Roth.

Chunk 159

A peek into token bias: Large language models are not yet genuine reasoners. arXiv preprint arXiv:2406.11050, 2024.

Chunk 160

Josh Kaufman. google-10000-english.

Chunk 161

urlhttps://github.com/first20hours/google-10000-english, 2021. [Accessed 2025-12-14].

Chunk 162

Peter J. Liu, Roman Novak, Jaehoon Lee, Mitchell Wortsman, Lechao Xiao, Katie Everett, Alexander A.

Chunk 163

Alemi, Mark Kurzeja, Pierre Marcenac, Izzeddin Gur, Simon Kornblith, Kelvin Xu, Gamaleldin Elsayed, Ian Fischer, Jeffrey Pennington, Ben Adlam, and Jascha- Sohl Dickstein. Nanodo: A minimal transformer decoder-only language model imple- mentation in JAX., 2024.

Chunk 164

URL http://github.com/google-deepmind/nanodo. Ilya Loshchilov, Frank Hutter, et al.

Chunk 165

Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101, 5:5, 2017.

Chunk 166

Alan Malek, Jiawei Ge, Nevena Lazi´c, Chi Jin, András György, and Csaba Szepesvári. Frontier llms still struggle with simple reasoning tasks.

Chunk 167

arXiv preprint arXiv:2507.07313, 2025. R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Grif- fiths.

Chunk 168

Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, 121(41): e2322420121, 2024a.

Chunk 169

R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Grif- fiths. When a language model is optimized for reasoning, does it still show embers of autoregression?

Chunk 170

an analysis of openai o1. arXiv preprint arXiv:2410.01792, 2024b.

Chunk 171

Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.

Chunk 172

In The Thirteenth International Conference on Learning Representations, 2024. Santiago Ontanon, Joshua Ainslie, Zachary Fisher, and Vaclav Cvicek.

Chunk 173

Making transformers solve compositional tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.

Chunk 174

3591–3607, 2022. Madhur Panwar, Kabir Ahuja, and Navin Goyal.

Chunk 175

In-context learning through the bayesian prism. In The Twelfth International Conference on Learning Representations, 2024.

Chunk 176

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.

Chunk 177

Journal of machine learning research, 21(140):1–67, 2020. Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli.

Chunk 178

Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. Advances in neural information processing systems, 36:14228–14246, 2023.

Chunk 179

Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task.

Chunk 180

arXiv preprint arXiv:2312.03002, 2023. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al.

Chunk 181

Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024.

Chunk 182

Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers.

Chunk 183

arXiv preprint arXiv:2305.16843, 2023. Abigail See, Peter J Liu, and Christopher D Manning.

Chunk 184

Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.

Chunk 185

1073–1083, 2017. 11 --- Page 12 --- Preprint.

Chunk 186

Under review. Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang.

Chunk 187

Posi- tional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737, 2023.

Chunk 188

Aaditya Singh, Stephanie Chan, Ted Moskovitz, Erin Grant, Andrew Saxe, and Felix Hill. The transient nature of emergent in-context learning in transformers.

Chunk 189

Advances in Neural Information Processing Systems, 36:27801–27819, 2023. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu.

Chunk 190

Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.

Chunk 191

Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.

Chunk 192

Nature, 625(7995):476–482, 2024. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin.

Chunk 193

Attention is all you need. Advances in neural informa- tion processing systems, 30, 2017.

Chunk 194

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks.

Chunk 195

Advances in neural information processing systems, 28, 2015. Honghua Zhang, Liunian Harold Li, Tao Meng, Kai-Wei Chang, and Guy Van den Broeck.

Chunk 196

On the paradox of learning to reason from data. In IJCAI, 2023.

Chunk 197

12 --- Page 13 --- Preprint. Under review.

Chunk 198

A Proof of Lemma 4.1 Proof. Let pt i,n = p(i|xn, W(t), θ(t)).

Chunk 199

The weights satisfy W(t+1) j −W(t+1) i = (1 −ληt) W(t) j −W(t) i −ηt 1 N N ∑ n=1 pt j,n −pt i,n ϕt n (2) By the Lagrange mean-value theorem, exp(b) −exp(a) = exp(c)(b −a) for some c such that a < c < b. Assume without loss of generality that ϕt⊤ n W(t) j ≥ϕt⊤ n W(t) i .

Chunk 200

Setting a = ϕt⊤ n W(t) i −ln Zt(ϕt n) and b = xt⊤ n W(t) j −ln Zt(ϕt n), where Zt(ϕn) = ∑y exp(ϕ⊤ n W(t) y ) is the normalizer, we have that for some pt n such that min(pt i,n, pt j,n) ≤pt n ≤max(pt i,n, pt j,n) pt j,n −pt i,n = exp(ϕt⊤ n W(t) j −ln Zt(ϕt n)) −exp(ϕt⊤ n W(t) i −ln Zt(ϕt n)) = pt nϕt⊤ n W(t) j −W(t) i (3) Let pt = maxn pt n. Plugging (3) into (2) and using the fact that inputs lie in a ball of radius r gives the result.

Chunk 201

B Implementation details Architecture and training. Our implementation of transformer training and evaluation builds upon the NanoDO framework (Liu et al., 2024).

Chunk 202

For the simple logic experiments, we train a decoder-only transformer with 10 layers and embedding size 1024 split across 8 heads. With copy attention, we add a copying block with 8 heads.

Chunk 203

The architecture is as described in Section 3. We use MLPs with hidden size 1024 and GELU activations (Hendrycks & Gimpel, 2016).

Chunk 204

We use RoPE positional embeddings (Su et al., 2024). We train using the AdamW optimizer (Loshchilov et al., 2017) with decay 0.001, batch size 256, peak learning rate 0.0001, warmup and final rate of 0.00001, and 2000 warmup steps.

Chunk 205

We clip gradients by global norm 1. We use a custom tokenizer designed for logic data.

Chunk 206

For the single-token symbol experiments, the vocabulary includes |P| ∈{150, 600, 2400, 9600} zero-order predicates, the special tokens shown in Figure 6, pad token, and 100 additional unused tokens. The vocabulary size is further padded by unused tokens to a multiple of 4.

Chunk 207

For the multi-token symbol experiments, the vocabulary includes 26 baseline tokens and sequences of these tokens form predicates. For each predicate in each example, we sample its length i ∈[n] and tokens uniformly at random, while ensuring there are no duplicates.

Chunk 208

Facts EndFacts Rules EndRules Query EndQuery Answer Reasoning EndReasoning Newfact EndNewfact If then EndIf yes no and ? <BOS> <EOS> <PAD> <UNK> Figure 6: Special tokens in the vocabulary in addition to zeroth order predicates.

Chunk 209

Data. We generate propositional logic problems similarly to the rule-priority (RP) recipe described in Zhang et al.

Chunk 210

(2023). RP problems are generated by randomly sampling a subset of predicates, and randomly sampling facts and rules using those predicates.

Chunk 211

The facts and rules are then used to compute the truth values. To balance label probabilities, we select the query to be a true predicate with probability 0.5.

Chunk 212

The problems we generated had between 5 and 20 predicates and between 0 and 40 rules. The examples presented to the model are formatted as in Figure 7.

Chunk 213

We generated reasoning traces of the form in Figure 7 by running forward chaining until either the query predicate was proven true or no further predicates could be proven true, and generating text corresponding to the algorithm trace. 13 --- Page 14 --- Preprint.

Chunk 214

Under review. <BOS> Rules: If A and B then C EndIf If B then A EndIf Facts: B EndFacts Query: C ?

Chunk 215

Reasoning: Facts: B EndFacts If B then A EndIf Newfact A EndNewfact Facts: B and A EndFacts If A and B then C EndIf Newfact C EndNewfact Facts: B and A and C EndFacts Answer: yes EndAnswer <EOS> Figure 7: Formatting example for a propositional logic problem with reasoning used for training. The separator ’and’ is only used in the multi-token symbol experiments.

Chunk 216

0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (100, 10k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 10k), copy all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (1k, 10k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 10k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 10k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 10k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 10k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 10k), copy all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (100, 50k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (100, 50k), copy all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 accuracy reinit WE (1k, 50k) all tokens seen | | = 150 | | = 600 | | = 2400 | | = 9600 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 50k) unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 50k) all predicates unseen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 50k), copy all tokens seen 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 50k), copy unseen query 0 10000 20000 30000 40000 50000 steps 0.0 0.2 0.4 0.6 0.8 1.0 reinit WE (1k, 50k), copy all predicates unseen Figure 8: Propositional logic evaluation accuracy for models trained on single-token symbols with (temporary) active forgetting. C Additional experiments with temporary active forgetting We evaluate the temporary active forgetting approach of Anand et al.

Chunk 217

(2025) on logic problems with single-token symbols. Here the (un)embeddings are periodically reinitialized every k1 steps for the first k2 steps of training.

Chunk 218

We consider the following values of (k1, k2): (100, 10k), (1k, 10k), (100, 50k), (1k, 50k). Note that we train for 50k steps in total, so some of the settings correspond to the (non-temporary) active forgetting approach of Chen et al.

Chunk 219

(2023). The results are shown in Figure 8.

Chunk 220

We observe that temporary active forgetting does not load to the desired inductive bias (symbolic reasoning), unless combined with copy attention. Active forgetting throughout the training can generalize with the vanilla architecture, provided sufficiently high symbolic diversity.

Chunk 221

14 --- Page 15 --- Preprint. Under review.

Chunk 222

Rules: If enchanting then UNK EndIf If perfect then silly EndIf [...] EndRules Facts: enchanting EndFacts Query: UNK EndQuery ? Reasoning: Facts: enchanting EndFacts If enchanting then perfect EndIf Newfact: perfect EndNewfact Facts: enchanting perfect EndFacts If enchanting then ugliest EndIf Newfact: ugliest EndNewfact Facts: enchanting perfect ugliest End- Facts If perfect then silly EndIf Newfact: silly EndNewfact Facts: enchanting perfect ugliest silly EndFacts EndReasoning Answer: no EndAnswer Rules: [...] EndRules Facts: hilarious perfect bored proud UNK EndFacts Query: UNK EndQuery ?

Chunk 223

Reasoning: Facts: hilarious perfect bored proud EndFacts EndReasoning Answer: no EndAnswer Figure 9: Examples of reasoning traces in the case of a single single-token symbolic variable. The model refuses to generate the unseen token in its reasoning trace, leading to CoT errors and eventually the wrong answer.

Chunk 224

Most mistakes occur when the true answer is ’yes’ and the model outputs ’no’. D Examples of model samples Figure 9 shows samples from the vanilla transformer trained on logic when the query (single-token) predicate is replaced by an unseen token.

Chunk 225

If the unseen token appears in rules, the model often replaces it by a seen token. If the unseen token appears in facts, the model often ignores it.

Chunk 226

Most model mistakes are false negatives. E Effect of freezing embeddings and copy attention on language modeling 0 20000 40000 60000 80000 100000 steps 3 4 5 6 7 8 9 10 C4 validation loss frozen WE 4 layers 8 layers 12 layers 16 layers 0 20000 40000 60000 80000 100000 steps 3 4 5 6 7 8 9 10 trainable WE 4 layers 8 layers 12 layers 16 layers 0 20000 40000 60000 80000 100000 steps 3 4 5 6 7 8 9 10 trainable WE, copy 4 layers 8 layers 12 layers 16 layers 0 20000 40000 60000 80000 100000 steps 0.0 0.2 0.4 0.6 0.8 trainable WE, vanilla - copy 4 layers 8 layers 12 layers 16 layers Figure 10: Evaluation loss on the C4 dataset with trainable and frozen embeddings and with copy attention.

Chunk 227

Freezing embeddings results in significantly higher validation loss compared to using trainable embeddings. Adding copy attention results in slightly smaller loss compared to the default architecture.

Chunk 228

To make the difference between the second and third figure visible, the last figure shows the difference in the validation loss of the second standard (vanilla) architecture and the one with the copy attention. In this section we train small transformers with 4, 8, 12, and 16 layers on the C4 dataset (Raffel et al., 2020), with trainable and frozen embeddings and with copy attention.

Chunk 229

We show the validation loss for all models in Figure 10. Freezing embeddings leads to considerably 15 --- Page 16 --- Preprint.

Chunk 230

Under review. worse evaluation loss and thus may not be a practical solution.

Chunk 231

Interestingly, copy attention has slightly better validation loss, especially early in the training and for small models. F Multi-token symbols Rules: If qwert, asdf then abcde.

Chunk 232

If zxcv then abcd. Facts: qwert, asdf.

Chunk 233

Query: abcd ? Figure 11: Stylized example where the multi-token query shares a prefix with another predicate.

Chunk 234

0 20000 40000 60000 80000 100000 steps 0.5 0.6 0.7 0.8 0.9 1.0 accuracy train/test english n tokens, test n = 2 n = 3 n = 4 n = 5 0 20000 40000 60000 80000 100000 steps 0.5 0.6 0.7 0.8 0.9 1.0 train/test english n tokens, test, prefix 0 20000 40000 60000 80000 100000 steps 0.5 0.6 0.7 0.8 0.9 1.0 train/test english n + 1 tokens 0 20000 40000 60000 80000 100000 steps 0.5 0.6 0.7 0.8 0.9 1.0 train/test english n + 1 tokens, prefix Figure 12: Evaluation of models trained on symbols of up to n tokens on problems with symbols of length up to n, and length n + 1. ’prefix’ in figure title means that the query shares a prefix with another predicate, making them more easily confusable.

Chunk 235

While our work focuses primarily on unseen tokens, here we also conduct a small prelimi- nary study on unseen multi-token variables (where all tokens are seen, but not all combinations are seen). This case has not been studied systematically in literature to the best of our knowl- edge, but it is related to much existing works on perturbing reasoning benchmarks by changing names and numerical values and using synonyms (Jiang et al., 2024; Mirzadeh et al., 2024).

Chunk 236

While embedding collapse is no longer an issue here, we observe other prob- lems, including length generalization and difficulty in distinguishing similar symbols (such as those sharing a prefix), which we describe next. To evaluate symbolic reasoning with multi-token symbols, we train models with symbols of length up to n tokens for n ∈{2, 3, 4, 5}.

Chunk 237

We select the number of tokens for each symbol in a problem uniformly at random. We use b = 26 base tokens (corresponding to lowercase English alphabet letters).

Chunk 238

For each length 1, ..., n, we generate up to m = 1000 symbols as follows. Using a corpus of 10K most frequent English words (Kaufman, 2021), we estimate the probability of the first letter and the next letter given the current letter, using counts regularized with a pseudo-count of 1 (i.e., the Laplace estimator).

Chunk 239

We use these to generate new random words (symbols). We split the symbols of each length evenly into train and test sets.

Chunk 240

We ensure that the train two-token symbols contain all tokens; thus, there are no unseen tokens or collapsed embeddings in this experiment, eliminating the problems stemming from the undesired behavior of the embeddings. We evaluate models on the test symbols of length up to n and on new symbols of length n + 1.

Chunk 241

Additionally, to make evaluation more adversarial, we create examples where the query shares a prefix with the longest non-query predicate (e.g., ’abcde’ and ’abcdf’), making the two more easily confusable. The results are shown in Figure 12.

Chunk 242

While models generalize well to problems involving test symbols of lengths seen during training, performance gets worse for symbols that are longer by even one token. By inspecting the samples containing errors, we observe the following qualitative behaviors.

Chunk 243

Models trained with n = 2 and n = 3 tend to truncate symbols to n tokens; these errors also occur for higher n, but less frequently. This finding may be of practical interest, as typical math datasets involve short variable names that are unlikely to be tokenized to more than 3 tokens, and similar truncations have also been 16 --- Page 17 --- Preprint.

Chunk 244

Under review. observed in large state-of-the-art models (Malek et al., 2025).

Chunk 245

For n ≥4, accuracy is lower on n + 1-token symbols when the query shares a prefix with another variable, suggesting that their representations start to collapse (this phenomenon was studied in more detail by Barbero et al. (2024)).

Chunk 246

Models evaluated on 5-token or 6-token symbols also occasionally hallucinate – go into a loop repeating a set of symbols. We speculate that this behavior is induced by long unlikely sequences of tokens.

Chunk 247

Thus, while using multi-token variables and using the same symbol tokens at train and test time resolves some of the issues studied in our work (specifically, embedding collapse and generating unseen tokens), it also comes with its own set of issues requiring different interventions, and is an interesting direction for future work. G Additional Gemma experiments 0 250 500 750 1000 singular value rank i 0.00 0.01 0.02 0.03 0.04 0.05 IT(i) PT(i) Gemma3 1B 0 500 1000 1500 2000 2500 i 0.0025 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 Gemma3 4B 0 1000 2000 3000 4000 i 0.004 0.003 0.002 0.001 0.000 0.001 0.002 0.003 Gemma3 12B 0 1000 2000 3000 4000 5000 i 0.004 0.003 0.002 0.001 0.000 0.001 0.002 Gemma3 27B Figure 13: Difference between the sorted singular values of the Frobenius-normalized (un)embeddings of instruction-tuned (IT) and pre-trained (PT) Gemma 3 models.

Chunk 248

We observe that the 1B, 4B, and 12B IT models have larger top two singular values than the PT versions, suggesting some collapse during the IT process. The 27B model has a larger top singular value.

Chunk 249

The effect is generally weaker in larger models which have a larger embedding dimension. 17

Back to search