Document Asssistant

Research Papers

Separable Expert Architecture, Toward Privacy Preserving Llm Personalization Via Composable Adapters And Deletable User Proxies

Document ID: research-papers-separable-expert-architecture-toward-privacy-preserving-llm-personalization-via-composable-adapters-and-deletable-user-proxies

Full content

--- Page 1 --- Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies Chris Schneider1 Philipp Schoenegger1 Ben Bariach1 1Microsoft AI Abstract Current model training approaches incorporate user in- formation directly into shared weights, making individual data removal computationally infeasible without retrain- ing. This paper presents a three-layer architecture that decouples personal data from shared weights by combin- ing a static base model, composable domain-expert LoRA adapters that shape behavior without imparting user data, and per-user proxy artefact whose deletion constitutes de- terministic unlearning. Evaluation on Phi-3.5-mini and Llama-3.1-8B confirms per-user differentiation in which personal data influences outputs while remaining isolated, verified by a return to baseline after proxy removal (KL ≈0.21 nats, 82–89% verification pass rate) and near- zero cross-user contamination. Because user-specific information never enters shared weights, the architec- ture mitigates model inversion, membership inference, and training-data extraction against shared model compo- nents by construction. The approach converts machine unlearning from an intractable weight-editing problem into a deterministic deletion operation that preserves per- sonalization alongside privacy-enhancing guarantees and is compatible with differentially private stochastic gra- dient descent (DP-SGD) for privacy-preserving shared model improvement. 1 Introduction As LLM personalization becomes widely used, a growing body of work has demonstrated that user preferences can be captured through retrieval-augmented profiles [1], post-hoc parameter merging [2], and personalized reward learning [3, 4]. While some of these approaches oper- ate at the prompt level (e.g., retrieval-augmented pro- files), many encode user-specific information into model weights θ via fine-tuning, producing models whose pa- rameters entangle contributions from many users. When a user later requests deletion it is unclear how one can remove their data from a model whose weights have been shaped by thousands of users simultaneously. This suggests that there is a fundamental tension be- tween personalization and data deletion in the context of modern LLMs. When user preferences are distributed across shared weights, deletion requires identifying and removing each user’s contribution, a problem that has shown to be computationally intractable without full retraining [5]. Exact unlearning methods like SISA [5] require maintaining independently trained model shards, while approximate methods offer no formal re- moval guarantees [6]. LLM-specific approaches face additional difficulties: Gradient ascent can cause catas- trophic collapse in certain unlearning configurations [7], and representation-level methods like RMU [8] still mod- ify shared weights. This problem is compounded by extraction attacks, including model inversion [9], train- ing data extraction [10, 11], and membership inference [12], which can recover private information from weight- encoded personalization,making it a privacy issue even absent deletion requests. To illustrate this, consider a personalized assistant that has learned a user’s medical vocabulary preferences through fine-tuning. Even after the user requests deletion, membership inference attacks SHARED COMPONENTS no user data USER PROXY Pu all user-specific data REMOVABLE — file deletion = full erasure Base Model θ (shared) Expert Router E1 Security E2 Code E3 Data E4 General Wbase + P wiBiAi Routing bias bu ∈Rk domain preference scores Personal LoRA Lu = (Bu, Au) user-specific weight residuals Steering vectors {sℓ u}ℓ∈L style and preference modifiers merge inject Query q Output Figure 1: Separable Expert Architecture. Shared components (left) contain no user-specific information: a frozen base model, four domain-expert LoRA adapters selected by a per-query router, and a weighted merge. The per-user proxy (right, dashed red border) holds three deletable personalization mechanisms (routing bias, personal LoRA, and contrastive steering vectors) that compose with shared components at inference via cross-boundary arrows. The vertical dashed line marks the separation boundary, where deleting the proxy directory removes all user-specific influence with zero retraining. 1 arXiv:2604.21571v1 [cs.AI] 23 Apr 2026 --- Page 2 --- could reveal whether that user’s data was part of the training set, while training data extraction could re- cover specific preference examples, all because the user’s influence remains distributed across millions of shared parameters. In order to address this issue, we propose the Sepa- rable Expert Architecture (SEA), a design that aims to satisfy both personalization and deletability simulta- neously. The core contribution is that if user-specific information never enters shared weights, “unlearning” is essentially just deletion. Rather than trying to sur- gically undo weight entanglement after the fact, this approach prevents entanglement from occurring in the first place. In other words, this requires an architecture where personalization is compositional, i.e., assembled at inference time from separable, deletable components, rather than absorptive, where preferences are baked into shared parameters. Contributions. We make three contributions: 1. A three-layer composition architecture where a base model (frozen, shared) is augmented by domain- expert LoRA adapters (shared, dynamically weighted by a query router) and per-user proxy artifacts, which are isolated directories containing a routing bias vec- tor, contrastive steering vectors, and a personal LoRA adapter (∼2–5 MB per user in our configuration). The architecture maintains a strict invariant: All user-specific information resides in a deletable arti- fact that never enters shared weights (§2). 2. A deletion protocol that reduces user removal to filesystem deletion of the proxy directory followed by noise-calibrated KL-divergence verification against a non-personalized baseline, requiring no retraining (§2.4) at all. 3. Additional empirical evidence across Phi-3.5-mini and Llama-3.1-8B with four domain experts and four synthetic user profiles, demonstrating measurable per- sonalization, verified deletion (82–89% verification pass rate), and clean cross-user isolation (contamina- tion ≤0.05 in point estimates) (§4). Related Work. Research on machine unlearning has shown that surgical removal of user influence from model weights is fundamentally hard, whether through exact retraining [5] or efficient approximate deletion [13], ap- proximate gradient manipulation [6, 14], LLM-specific methods such as model-generated knowledge replace- ment [15], NPO [7], or representation-level unlearning [8]. On the other hand, the infrastructure for compos- able adapter stacks has matured substantially: LoRA [16] and QLoRA [17] enable efficient adapter training, LoraHub [18] and task arithmetic [19, 20] demonstrate multi-adapter composition, and S-LoRA [21] enables serving thousands of concurrent adapters from a single base model while Punica [22] provides efficient multi- tenant batching via segmented gather-matrix-vector kernels. Activation steering methods, including Con- trastive Activation Addition [23] and Inference-Time Intervention [24], show that behavioral modification without weight changes can be both effective and rel- atively lightweight. LLM personalization approaches, including LaMP [1], Personalized Soups [2], P-RLHF [3], and VPL [4], capture user preferences through var- ious mechanisms. However, none of these approaches architecturally separates user state from shared weights, meaning that deletion would require either retraining or approximate weight modification, the same intractable operations the unlearning literature has already identi- fied as problematic [5, 6]. Adding a deletion mechanism post hoc does not resolve this as the entanglement occurs during training, and no inference-time wrapper can undo it. The infrastructure for composable, per-user adapter stacks exists, but what is largely missing is a deletion- aware composition design that prevents entanglement from occurring in the first place. SEA bridges this gap by ensuring that personalization state is architecturally separable from shared model components. In the rest of the paper, we go through the architecture and deletion protocol of the SEA (§2), the experimental setup (§3), and the results (§4), before closing with a discussion of implications and limitations (§5). 2 Architecture In this section, we present SEA’s three-layer composition architecture and its core design invariant. The central claim is that the user-specific information has to be structurally separated from shared model components such that deletion becomes a deterministic filesystem op- eration rather than an approximate weight-modification procedure. We first state the invariant (§2.1), then de- scribe the three composition layers (§2.2), detail the inference pipeline (§2.3), and lastly present the deletion protocol (§2.4). 2.1 Design Invariant SEA maintains a strict architectural invariant that dis- tinguishes it from approximate unlearning approaches and provides the basis for the deletion protocol: Invariant 1 (Separation). All user-specific information resides in an isolated, deletable proxy artifact. Shared model components (the base model and expert adapters) contain no user-identifying information. Removing the proxy artifact is both necessary and sufficient for com- plete user data removal from the inference system. Importantly, this invariant is structural as opposed to statistical. While approximate unlearning methods provide probabilistic guarantees that user influence has been reduced below some threshold, Invariant 1 guar- antees that user influence is architecturally absent from shared components. In other words, the guarantee holds by construction as the system never permits user-specific gradients to flow into shared weights, so there is nothing to remove. 2.2 Three-Layer Composition SEA combines three layers at inference time (Figure 1): a frozen base model that provides general capabilities, shared domain-expert LoRA adapters that provide spe- cialized knowledge, and per-user proxy artifacts that provide deletable personalization. Base Layer. The base layer is a frozen, quantized LLM that provides general language capabilities and is shared across all users. It contains no user-specific 2 --- Page 3 --- information by design, and the base weights are never modified during user interactions. Periodic retraining on aggregated data with differential privacy guarantees (DP-SGD [25]) is a natural extension but is out of scope for this paper. Expert Layer. A bank of k domain-specific LoRA adapters E = {E1, . . . , Ek} provides specialized capa- bilities for distinct knowledge domains. Each expert Ei = (Bi, Ai) is a low-rank adapter trained on curated domain corpora and shared across all users, with experts encoding domain knowledge only. At inference, experts combine via weighted linear combination (Equation 1): Wexpert = Wbase + k X i=1 wi · BiAi (1) where w ∈∆k (the probability simplex) are mixing co- efficients determined per-query by a lightweight router. User Layer. Each user u has an isolated proxy arti- fact Pu, which is a self-contained directory comprising three complementary personalization mechanisms, each stored as serialized tensors: 1. Routing bias vector bu ∈Rk: A learned vector of domain affinity scores derived from user interac- tion patterns that shifts expert selection toward user- preferred domains. The bias is applied as a scaled additive adjustment with clamp-and-normalize: ˜wi = w0,i + λ bu,i, wi = max( ˜wi, 0) P j max( ˜wj, 0) (2) where w0 is the router’s base distribution and λ is a bias scale that prevents raw affinity values from overwhelming the base routing. If P j max( ˜wj, 0) = 0, the distribution falls back to uniform: wi = 1/k. 2. Contrastive steering vectors {sℓ u}ℓ∈L at a subset of intermediate layers L: Computed via Contrastive Activation Addition 23 from user preference pairs and injected additively into residual stream activations at inference: hℓ←hℓ+ γ sℓ u (3) where γ is a steering strength multiplier. These vec- tors encode stylistic preferences (verbosity, formal- ity, technical depth) without modifying any model weights, making them particularly well-suited for deletable personalization. 3. Personal LoRA adapter Lu = (Bu, Au): A low- rank adapter trained on user preference pairs. This adapter captures user-specific knowledge and re- sponse patterns that routing bias and steering alone cannot express, resulting in additional personaliza- tion. The rank is deliberately kept small to bound proxy size and maintain a clear separation guarantee. During personal LoRA training via DPO, the base model and expert adapter weights are then frozen, such that only the rank-4 personal LoRA parameters receive gradient updates, ensuring that user-specific gradients never flow into shared components. The proxy is operationally independent of shared weights at inference time, as it is a self-contained, deletable artefact whose removal then eliminates all user-specific influence from the system. However, note that the personal LoRA is conditioned on the shared model during DPO, where the base model serves as the reference, so the proxy’s content reflects shared model state even though no user information flows in the re- verse direction. 2.3 Inference Pipeline Given query q from user u, inference proceeds in five stages that combine the three layers into a single gener- ation pass: 1. Route. A lightweight router classifies q into a do- main distribution w0 ∈∆k over the k experts. 2. Bias. The user’s routing bias is applied via Equa- tion 2, shifting expert selection toward the user’s preferred domains based on their accumulated inter- action history. 3. Merge. The weighted expert adapters and personal LoRA are combined into a single merged adapter applied to the base model. 4. Steer. Forward hooks inject the user’s steering vec- tors γ sℓ u at layers ℓ∈L via Equation 3, modifying activations without changing any weights. 5. Generate. Standard autoregressive decoding with the merged model produces the personalized output. 2.4 Deletion Protocol SEA’s deletion protocol exploits the architectural in- variant (Invariant 1) to reduce user removal to a simple filesystem operation with statistical verification. The key challenge we address is establishing that removing a user’s proxy artifact fully eliminates all user-specific influence on model behavior. To delete user u, the protocol proceeds in three steps: 1. Verify. On held-out domain-generic prompts (not user-specific, to avoid circular verification): gener- ate outputs in omission mode (proxy not loaded) and compare token-frequency distributions against a cached non-personalized baseline (base model + ex- perts, no proxy) via KL divergence. Verification uses a noise-calibrated threshold: the inter-sample KL di- vergence among unpersonalized generations provides an empirical noise floor ˆσKL for stochastic decoding, and bypass is confirmed when DKL(punpers∥pbaseline) ≤max 2 ˆσKL, τmin  (4) where τmin = 0.15 nats is a hard floor that prevents unreasonably tight thresholds on low-variance queries. This makes verification self-calibrating: queries with high stochastic variance receive a proportionally wider acceptance band, eliminating false failures from sampling noise without weakening the guarantee for stable queries. 2. Delete. Secure filesystem removal of the proxy di- rectory Pu (zero-overwrite). 3. Audit. Log the deletion event, verification result, and timestamp for compliance trail. The architectural separation produces a direct payoff here. Without the proxy, the system’s behavior is struc- turally equivalent in expectation to the non-personalized baseline. The same code paths execute with the same 3 --- Page 4 --- weights, with the proxy simply not loaded. Verifica- tion exploits this architectural equivalence: omitting the proxy at inference time is functionally identical to deleting it, so the verify step confirms deletion behavior before the irreversible delete step. The KL-divergence verification is therefore a sanity check confirming the architectural guarantee, not the privacy guarantee itself. The guarantee comes from the invariant: user informa- tion exists only in the proxy, and the proxy has been deleted. Cached baselines must be refreshed whenever shared components (base model or expert adapters) are updated; if a new base model is deployed, personal LoRA adapters must be regenerated. 3 Experimental Setup We evaluate SEA across two base models, four domain experts, and four synthetic user profiles, targeting three evaluation dimensions: personalization quality, deletion completeness, and cross-user isolation. We first describe the experimental configuration and then present the results. Models. We use two base models: Phi-3.5-mini- instruct (3.8B parameters) and Llama-3.1-8B-Instruct, both loaded in 4-bit NormalFloat (NF4) quantization via QLoRA [17]. These models span a range of parameter counts to test whether the architectural properties hold across model scales. Expert Adapters. Four domain experts (k = 4) are trained via supervised fine-tuning with TRL [26], all using rank 32, scaling factor α = 64, applied to all attention projections (query, key, value, output): Security (Trendyol + OWASP-NVD, ∼76K examples), Code (CodeAlpaca + supplementary code instruction sets, capped at ∼50K examples), Data (synthetic text- to-SQL), and General (Alpaca, ∼52K examples). These experts are shared across all users and contain domain knowledge only. Synthetic User Profiles. Four user pro- files (security_expert, casual_coder, data_analyst, general_user) are each defined by domain affinity weights and positive/negative style traits. Proxy ar- tifacts are generated through three mechanisms: rout- ing bias via EMA from simulated interaction patterns (λ = 0.5), steering vectors via CAA from trait-aligned preference pairs at layers L = {12, 16, 20} with strength γ = 1.0, and personal LoRA (rank 4) via DPO [27] on preference pairs, using the base model as the DPO reference. The total proxy size is approximately 2–5 MB per user. Routing and Composition. The expert router uses zero-shot entailment-based classification [28] using BART-MNLI [29] with keyword-based fallback (soft- max temperature T = 2.0 for the fallback path). Adapter merging uses PEFT’s add_weighted_adapter with combination_type="linear" and a load-once life- cycle with deferred cleanup. Evaluation Protocol. We conduct 70 evaluation runs per model (140 total) across 20 evaluation prompts (5 per domain).1 Cached baselines ensure consistency 1Each evaluation run generates 7 bypass observations (a subset of query-user combinations selected from the held-out verification across runs, and 95% confidence intervals are reported via the t-distribution. Style trait match. Style trait match is defined as the number of target style keywords detected in a per- sonalized generation. Each user profile specifies a set of positive style traits as keywords (e.g., terms associated with verbosity, technical depth, or domain-specific vocab- ulary), and the metric counts how many appear in each output. The reported value is the mean count across all prompt-user-run observations (1,904 for Phi-3.5-mini, 1,960 for Llama-3.1-8B). The scale is profile-dependent: the security expert profile achieves a mean of 3.01 (Phi) and 1.02 (Llama), while the general user profile aver- ages 0.21 and 0.28 respectively. Keyword presence is a necessary but not sufficient indicator of style alignment, as a response containing a target keyword may use it in a non-stylistic context. The metric should therefore be understood as a lower bound on non-match rather than a calibrated measure of style fidelity. 4 Results We organize results around three claims that jointly aim to validate the architectural design. First, we show that the proxy achieves measurable personalization (§4.1), second, that the proxy removal restores baseline behavior (§4.2), and third that no cross-user leakage occurs (§4.3). Together, these claims address the central question of whether architectural separation can simultaneously de- liver personalization, deletability, and isolation. 4.1 Personalization The proxy measurably adapts model outputs without modifying shared weights. Table 1 shows three distinct findings. First, routing bias successfully shifts expert selection toward each user’s preferred domain (weight shift 0.052–0.088). Second, Jaccard similarity to the non- personalized baseline is low (0.236–0.316), indicating substantial output differentiation. Third, style trait matching is stronger for Phi-3.5-mini (1.71) than Llama- 3.1-8B (0.63), an observed difference between these two specific models that should not be attributed to model size given N=2 and multiple confounds. Table 1: Personalization metrics across both base models. Weight shift measures the routing bias effect on expert selec- tion. Jaccard similarity to baseline measures output overlap (lower = more personalized). Style trait match measures alignment with target user traits. Metric Phi-3.5-mini Llama-3.1-8B Weight shift 0.052 ± 0.002 0.088 ± 0.003 Jaccard similarity 0.236 ± 0.005 0.316 ± 0.005 Style trait match 1.710 ± 0.101 0.629 ± 0.040 The three-mechanism proxy thus achieves moderate- to-strong personalization for Phi-3.5-mini and moder- ate personalization for Llama-3.1-8B, without touching shared weights. The personalization is present but delib- erately moderate in scope, a consequence of the rank-4 prompts). Phi-3.5-mini completed 68 runs (476 observations); Llama-3.1-8B completed 70 runs (490 observations). Two early Phi-3.5-mini runs were configuration tests that produced no bypass data. 4 --- Page 5 --- Figure 2: Distribution of unpersonalized-to-baseline KL-divergence scores across all prompt-user combinations for both base models (476 observations for Phi-3.5-mini, 490 for Llama-3.1-8B). Dashed lines mark the per-model mean. Verification uses a noise-calibrated per-query threshold (Equation 4) rather than a fixed cutoff, so no single threshold line is shown. The KL distribution is bimodal rather than gradual: verified observations cluster in [0.00, 0.30] and failures in [0.30, 0.94], with no ambiguous intermediate population. This sharp boundary is consistent with the structural guarantee, as proxy removal either fully eliminates user influence (the common case) or generation variance produces an outlier sample (the failure case), with no evidence of partial leakage. constraint on the personal LoRA, which is the price of deletability and a central trade-off of our design. More expressive adapters would capture richer user preferences but would require more parameters, increasing proxy size and reducing the clarity of the separation guaran- tee. The security expert profile produces the strongest personalization signal (mean style trait match 3.01 on Phi-3.5-mini, with individual observations reaching 12), yet bypass verification for this profile’s queries passes at rates comparable to lower-personalization profiles. The architecture does not trade deletion reliability for personalization intensity. 4.2 Separability Next, we find that proxy removal restores baseline be- havior, which confirms the architectural invariant. Ta- ble 2 shows two main results. First, mean KL diver- gence between unpersonalized and baseline outputs is approximately 0.21 nats for both models. Second, the 82–89% noise-calibrated verification pass rate indicates that the vast majority of prompt-user combinations pro- duce outputs statistically indistinguishable from the non-personalized baseline after proxy removal. Table 2: Deletion verification metrics. Verification pass rate is the fraction of prompt-user combinations where the unpersonalized-to-baseline KL divergence falls within the noise-calibrated threshold (Equation 4). Metric Phi-3.5-mini Llama-3.1-8B Verified pass rate 0.819 ± 0.035 0.892 ± 0.028 KL divergence 0.217 ± 0.012 0.212 ± 0.006 Figure 2 shows the distribution of KL-divergence scores across all prompt-user combinations. Importantly, the deletion itself is deterministic and complete, as the proxy files are removed and the shared weights are un- touched. The KL verification is a separate measurement that compares stochastic outputs from finite-length gen- erations. By calibrating the acceptance threshold against the empirical inter-sample noise floor per query, the ver- ification procedure accounts for the inherent variance of stochastic decoding: Queries that naturally produce high output variance receive a proportionally wider threshold, while stable queries are held to a tighter standard. The 11–18% of cases that still exceed the noise-calibrated threshold likely reflect edge cases where generation vari- ance is unusually high relative to the measured noise floor, not residual user influence in the weights.2 The deletion verification thus provides empirical confirma- tion of the architectural guarantee, though the guarantee itself rests on the structural invariant rather than the verification metric. Threshold sensitivity. The verification pass rate reported above depends on the 2ˆσKL multiplier in Equa- tion 4. Table 3 shows how the pass rate varies across multiplier settings. The hard floor τmin is inert across the tested range [0.10, 0.25] because the empirical noise floor ˆσKL ≈0.15 nats is stable across all query-user pairs (range [0.146, 0.157]), making the multiplier the sole active control. The floor would activate only if ˆσKL dropped below τmin/mult (approximately 0.075 nats at the paper’s 2σ, τmin = 0.15 configuration), which does not occur in this data. A single multiplier param- 2A small number of Phi-3.5-mini observations produced de- generate (near-empty) outputs due to an inference configuration issue that did not affect Llama-3.1-8B runs. These observations yield artificially low KL values and are retained in the reported statistics for transparency. Filtering them would increase the mean KL slightly and marginally reduce the reported pass rate for Phi-3.5-mini. 5 --- Page 6 --- eter therefore suffices for threshold calibration. This cross-query, cross-user, cross-model consistency was not guaranteed by the architecture and constitutes an em- pirical finding: the stochastic decoding noise floor is a property of the generation process, not of the personal- ization mechanism, which is what a structurally clean separation should produce. Table 3: Verification pass rate by σ multiplier. The chosen 2σ configuration (bold) sits in the moderate region of a monotonic curve. Stricter deployments could tighten to 1.5σ at the cost of more false failures; those prioritizing operational stability could relax to 2.5σ. Multiplier Phi-3.5-mini (n=476) Llama-3.1-8B (n=490) 1.0σ 0.239 0.167 1.5σ 0.513 0.600 2.0σ 0.819 0.892 2.5σ 0.929 0.984 3.0σ 0.971 0.994 Pass rates increase monotonically with no disconti- nuities. The deletion guarantee is independent of these parameters, as this analysis characterizes verification sensitivity as opposed to deletion completeness. The KL distributions across all observations have mean 0.218 (Phi) and 0.213 (Llama), with standard deviations of 0.132 and 0.070 respectively. Phi-3.5-mini has a heavier right tail (95th percentile 0.402 vs 0.340), which explains its lower pass rate at the same threshold. 4.3 Isolation Moreover, our results suggest that no cross-user leak- age occurs between proxies. Table 4 shows very low levels of contamination: 0.009 and 0.049 for Phi-3.5- mini and Llama-3.1-8B respectively, suggesting that one user’s proxy does not influence another user’s outputs. Cross-user output similarity is moderate (0.27–0.35) but expected, as users share the same base model and expert adapters. This similarity is structural and not leakage, reflecting the shared foundation rather than cross-user information flow. Table 4: Cross-user isolation metrics. Contamination mea- sures excess inter-user similarity beyond the shared baseline. Metric Phi-3.5-mini Llama-3.1-8B Contamination 0.009 ± 0.002 0.049 ± 0.005 Cross-user similarity 0.271 ± 0.010 0.351 ± 0.007 Since proxies exist as isolated filesystem artifacts with no shared mutable state, this result follows from the architecture. However, we include it as empirical ver- ification that the isolation invariant holds in practice under realistic generation conditions. Summary. Taken together, the three claims are sup- ported across both models with some between-model het- erogeneity: Phi-3.5-mini shows stronger personalization and isolation, while Llama-3.1-8B shows stronger dele- tion verification rates. Llama-3.1-8B achieves a higher verification pass rate (89.2% vs 81.9%) with a substan- tially tighter KL distribution (std 0.070 vs 0.132), indi- cating that the deletion properties of the architecture do not degrade at the larger model scale. This shows that architectural separation achieves personalization with verified deletion and clean isolation, while the tradeoff between personalization expressiveness and deletability is explicit. The proxy’s tunable parameters (personal LoRA rank, steering strength γ, routing bias scale λ) define a configuration space that could be explored to characterize this tradeoff, though the current evaluation uses a single configuration throughout. 5 Discussion Contribution. SEA sidesteps the machine unlearning problem rather than solving it. Machine unlearning is fundamentally hard because it attempts to undo an irreversible operation, the entanglement of user-specific gradients with shared weights. Even the most promising methods either require retraining or cannot guarantee complete removal. Architectural separation prevents en- tanglement in the first place, converting an intractable algorithmic problem into a tractable engineering one. The core tradeoff is explicit: A low-rank personal LoRA is less expressive than full fine-tuning, but the three- mechanism proxy compensates for this by providing complementary personalization channels (routing bias for domain preferences, steering vectors for stylistic pref- erences, and personal LoRA for residual patterns). The architecture’s parameters (personal LoRA rank, steering strength γ, routing bias scale λ) define a per-deployment configuration space in which personalization fidelity can be traded against proxy size and separation clarity. Char- acterizing this tradeoff empirically, for instance by com- paring rank-4 against rank-8 or rank-16 personal LoRA under the same deletion protocol, remains future work. A notable consequence of the separation invariant is that shared model components (the base model and expert adapters) can be released or audited without risk of user data exposure, since no user-specific information enters shared weights by construction. Moreover, it is important to note that our approach requires designing the system with deletion in mind from the start and cannot be retrofitted to existing models where user data has already been absorbed into weights. Findings. Our evaluation across two base models shows three main results. First, the personal proxy produces measurable personalization, with users receiv- ing responses that reflect their domain preferences and stylistic tendencies, with consistent shifts in routing weights and style trait alignment. Second, deletion ver- ification works: When a user’s proxy is removed, the system’s outputs return to baseline behavior in 82–89% of test cases, with the remaining failures attributable to normal generation randomness rather than lingering user influence (the architecture structurally guarantees that no trace of the user persists). Third, user isolation holds with one user’s proxy not detectably influencing another user’s outputs (contamination ≤0.05 in point es- timates). These results come with the inherent tradeoff that deletability limits how deeply the system can per- sonalize, since user data must remain separable rather than being absorbed into shared model weights. We view this as a reasonable price for deployments where data deletion rights must be honored. Limitations and future work. Several limitations 6 --- Page 7 --- constrain the current evaluation. The synthetic user profiles used here are placeholders for real-world prefer- ences, and the four profiles are aligned to four distinct domains, representing the easiest possible configuration for isolation testing; overlapping-domain profiles (e.g., two security-focused users with different stylistic pref- erences) would provide a harder and more realistic test of cross-user isolation, though the structural separation guarantee is unaffected by profile design. The metrics (Jaccard similarity, keyword matching) capture basic textual overlap rather than subjective personalization quality as perceived by users in order to demonstrate the proof-of-concept. Second, the evaluation at 3.8–8B parameter scale is not intended to generalize to larger models, though the architectural invariant (separation of user data into a deletable proxy) holds by construction regardless of model size. Third, the current evalua- tion does not include an ablation study isolating the contribution of each proxy component (routing bias, steering vectors, personal LoRA individually); such an ablation would clarify which mechanisms drive person- alization and deletion properties and is a natural next step. Additionally, while architectural separation elim- inates the risk of user data being entangled in shared weights, the proxy artifact concentrates user behavioral information into a portable representation, creating an attack surface where an attacker need only exfiltrate a single directory rather than extract user influence from distributed weights. For open-source base mod- els, including both models evaluated in this paper, an exfiltrated proxy could be loaded directly against a lo- cal copy. Non-transferability of exfiltrated proxies is therefore a hypothesis requiring empirical validation through cross-model transfer experiments, not a default assumption. Securing proxy artifacts through encryp- tion at rest, access controls, and retention policies is necessary for end-to-end privacy and should be treated as a deployment requirement. Tractable deletion is also a dual-use capability, with the same mechanism that enables personal data removal also being easily applied to remove other content or proprietary knowledge from model integration, with implications for compliance au- diting that merit careful analysis. Lastly, expert adapter training may not have converged, as loss plateaus were not reached during the experiments, suggesting that additional training could improve adapter quality. The most immediate extension is applying DP-SGD to the gradient aggregation stage when updating shared expert adapters from user interaction data, which the architecture already supports by construction. Three practical constraints govern this extension: the com- putational overhead of per-sample gradient clipping, accelerated privacy budget exhaustion under sequential composition, and utility degradation in low-ε regimes. Aggregating LoRA updates across a large user pop- ulation prior to noise injection could provide privacy amplification, since individual contributions to the aggre- gate gradient would be attenuated by population scale. However, formal privacy amplification results depend on specific mathematical conditions, including Poisson sub- sampling of participants, bounded per-sample sensitivity, and particular composition theorems [30, 31], none of which have been verified for this architecture. Whether SEA’s gradient aggregation satisfies these conditions, and whether the resulting ε-utility tradeoff is favorable in practice, are open empirical questions that require measuring privacy loss under varying ε and population- size configurations through empirical attacks (model inversion, membership inference) against the updated shared model. Beyond DP-SGD, scaling to production multi-tenant workloads via adapter-serving frameworks such as S-LoRA and Punica, validating the privacy guarantees through longitudinal studies with real users and adversarial probes, and characterizing the tradeoff between personalization depth and proxy size are all natural next steps. References [1] Alireza Salemi, Sheshera Mysore, Michael Bender- sky, and Hamed Zamani. Lamp: When large lan- guage models meet personalization. In Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/abs/2304.11406. [2] Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc pa- rameter merging. In Advances in Neural Infor- mation Processing Systems, 2023. URL https: //arxiv.org/abs/2310.11564. [3] Xinyu Li, Ruiyang Zhou, Zachary C. Lipton, and Leqi Liu. Personalized language modeling from personalized human feedback. arXiv preprint arXiv:2402.05133, 2024. URL https://arxiv. org/abs/2402.05133. [4] Sriyash Poddar, Yanming Wan, Hamish Ivison, Ab- hishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. In Advances in Neu- ral Information Processing Systems 37 (NeurIPS 2024), 2024. URL https://arxiv.org/abs/2408. 10075. [5] Lucas Bourtoule, Varun Chandrasekaran, Christo- pher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Pa- pernot. Machine unlearning. In 2021 IEEE Sym- posium on Security and Privacy (SP), 2021. URL https://arxiv.org/abs/1912.03817. [6] Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Se- lective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9304–9312, 2020. URL https://arxiv.org/abs/1911.04933. [7] Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catas- trophic collapse to effective unlearning. In Confer- 7 --- Page 8 --- ence on Language Modeling (COLM 2024), 2024. URL https://arxiv.org/abs/2404.05868. [8] Nathaniel Li, Alexander Pan, Anjali Gopal, Sum- mer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xi- aoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Os- wal, Weiran Lin, Adam A. Hunt, Justin Tienken- Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Camp- bell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadhara- jan, Ruoyu Wang, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, and Dan Hendrycks. The WMDP benchmark: Measur- ing and reducing malicious use with unlearning. In Proceedings of the 41st International Confer- ence on Machine Learning (ICML), 2024. URL https://arxiv.org/abs/2403.03218. [9] Matt Fredrikson, Somesh Jha, and Thomas Risten- part. Model inversion attacks that exploit confi- dence information and basic countermeasures. In Proceedings of the 2015 ACM SIGSAC Conference on Computer and Communications Security (CCS ’15), 2015. doi: 10.1145/2810103.2813677. [10] Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Ex- tracting training data from large language models. In 30th USENIX Security Symposium, 2021. URL https://arxiv.org/abs/2012.07805. [11] Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ip- polito, Christopher A. Choquette-Choo, Eric Wal- lace, Florian Tramèr, and Katherine Lee. Scalable extraction of training data from (production) lan- guage models. arXiv preprint arXiv:2311.17035, 2023. URL https://arxiv.org/abs/2311.17035. [12] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017. doi: 10.1109/SP.2017.41. URL https://arxiv.org/abs/1610.05820. [13] Antonio Ginart, Melody Y. Guan, Gregory Valiant, and James Zou. Making AI forget you: Data dele- tion in machine learning. In Advances in Neural Information Processing Systems (NeurIPS), vol- ume 32, 2019. URL https://arxiv.org/abs/ 1907.05012. [14] Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, vol- ume 35, pages 11516–11524, 2021. URL https: //arxiv.org/abs/2010.10981. [15] Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in LLMs. In Inter- national Conference on Learning Representations (ICLR 2024), 2024. URL https://arxiv.org/ abs/2310.02238. [16] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR 2022), 2022. URL https://arxiv.org/abs/2106.09685. [17] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. In Advances in Neural Informa- tion Processing Systems 36 (NeurIPS 2023), 2023. URL https://arxiv.org/abs/2305.14314. [18] Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. LoraHub: Ef- ficient cross-task generalization via dynamic LoRA composition. In Conference on Language Modeling (COLM 2024), 2024. URL https://arxiv.org/ abs/2307.13269. [19] Jinghan Zhang, Shiqi Chen, Junteng Liu, and Junx- ian He. Composing parameter-efficient modules with arithmetic operations. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2306.14870. [20] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Edit- ing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022. URL https://arxiv. org/abs/2212.04089. [21] Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica. S-LoRA: Serv- ing thousands of concurrent LoRA adapters. In Proceedings of Machine Learning and Systems 6 (MLSys 2024), 2024. URL https://arxiv.org/ abs/2311.03285. [22] Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. Punica: Multi-tenant LoRA serving. In Proceedings of Ma- chine Learning and Systems 6 (MLSys 2024), 2024. URL https://arxiv.org/abs/2310.18547. [23] Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering Llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/abs/2312.06681. 8 --- Page 9 --- [24] Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful an- swers from a language model. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. URL https://arxiv.org/abs/2306.03341. [25] Martín Abadi, Andy Chu, Ian Goodfellow, H. Bren- dan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Confer- ence on Computer and Communications Security (CCS ’16), 2016. URL https://arxiv.org/abs/ 1607.00133. [26] Leandro von Werra, Younes Belkada, Lewis Tun- stall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformer reinforce- ment learning, 2020. URL https://github.com/ huggingface/trl. [27] Rafael Rafailov, Archit Sharma, Eric Mitchell, Ste- fano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your lan- guage model is secretly a reward model. In Ad- vances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. URL https://arxiv. org/abs/2305.18290. [28] Wenpeng Yin, Jamaal Hay, and Dan Roth. Bench- marking zero-shot text classification: Datasets, eval- uation and entailment approach. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3914–3923, 2019. URL https://arxiv.org/abs/1909.00161. [29] Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettle- moyer. BART: Denoising sequence-to-sequence pre- training for natural language generation, transla- tion, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics (ACL 2020), 2020. URL https://arxiv.org/abs/1910.13461. [30] Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy amplification by subsampling: Tight anal- yses via couplings and divergences. Advances in Neural Information Processing Systems, 31, 2018. [31] Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Sympo- sium (CSF), pages 263–275. IEEE, 2017. 9

Chunks

Chunk 0

--- Page 1 --- Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies Chris Schneider1 Philipp Schoenegger1 Ben Bariach1 1Microsoft AI Abstract Current model training approaches incorporate user in- formation directly into shared weights, making individual data removal computationally infeasible without retrain- ing. This paper presents a three-layer architecture that decouples personal data from shared weights by combin- ing a static base model, composable domain-expert LoRA adapters that shape behavior without imparting user data, and per-user proxy artefact whose deletion constitutes de- terministic unlearning.

Chunk 1

Evaluation on Phi-3.5-mini and Llama-3.1-8B confirms per-user differentiation in which personal data influences outputs while remaining isolated, verified by a return to baseline after proxy removal (KL ≈0.21 nats, 82–89% verification pass rate) and near- zero cross-user contamination. Because user-specific information never enters shared weights, the architec- ture mitigates model inversion, membership inference, and training-data extraction against shared model compo- nents by construction.

Chunk 2

The approach converts machine unlearning from an intractable weight-editing problem into a deterministic deletion operation that preserves per- sonalization alongside privacy-enhancing guarantees and is compatible with differentially private stochastic gra- dient descent (DP-SGD) for privacy-preserving shared model improvement. 1 Introduction As LLM personalization becomes widely used, a growing body of work has demonstrated that user preferences can be captured through retrieval-augmented profiles [1], post-hoc parameter merging [2], and personalized reward learning [3, 4].

Chunk 3

While some of these approaches oper- ate at the prompt level (e.g., retrieval-augmented pro- files), many encode user-specific information into model weights θ via fine-tuning, producing models whose pa- rameters entangle contributions from many users. When a user later requests deletion it is unclear how one can remove their data from a model whose weights have been shaped by thousands of users simultaneously.

Chunk 4

This suggests that there is a fundamental tension be- tween personalization and data deletion in the context of modern LLMs. When user preferences are distributed across shared weights, deletion requires identifying and removing each user’s contribution, a problem that has shown to be computationally intractable without full retraining [5].

Chunk 5

Exact unlearning methods like SISA [5] require maintaining independently trained model shards, while approximate methods offer no formal re- moval guarantees [6]. LLM-specific approaches face additional difficulties: Gradient ascent can cause catas- trophic collapse in certain unlearning configurations [7], and representation-level methods like RMU [8] still mod- ify shared weights.

Chunk 6

This problem is compounded by extraction attacks, including model inversion [9], train- ing data extraction [10, 11], and membership inference [12], which can recover private information from weight- encoded personalization,making it a privacy issue even absent deletion requests. To illustrate this, consider a personalized assistant that has learned a user’s medical vocabulary preferences through fine-tuning.

Chunk 7

Even after the user requests deletion, membership inference attacks SHARED COMPONENTS no user data USER PROXY Pu all user-specific data REMOVABLE — file deletion = full erasure Base Model θ (shared) Expert Router E1 Security E2 Code E3 Data E4 General Wbase + P wiBiAi Routing bias bu ∈Rk domain preference scores Personal LoRA Lu = (Bu, Au) user-specific weight residuals Steering vectors {sℓ u}ℓ∈L style and preference modifiers merge inject Query q Output Figure 1: Separable Expert Architecture. Shared components (left) contain no user-specific information: a frozen base model, four domain-expert LoRA adapters selected by a per-query router, and a weighted merge.

Chunk 8

The per-user proxy (right, dashed red border) holds three deletable personalization mechanisms (routing bias, personal LoRA, and contrastive steering vectors) that compose with shared components at inference via cross-boundary arrows. The vertical dashed line marks the separation boundary, where deleting the proxy directory removes all user-specific influence with zero retraining.

Chunk 9

1 arXiv:2604.21571v1 [cs.AI] 23 Apr 2026 --- Page 2 --- could reveal whether that user’s data was part of the training set, while training data extraction could re- cover specific preference examples, all because the user’s influence remains distributed across millions of shared parameters. In order to address this issue, we propose the Sepa- rable Expert Architecture (SEA), a design that aims to satisfy both personalization and deletability simulta- neously.

Chunk 10

The core contribution is that if user-specific information never enters shared weights, “unlearning” is essentially just deletion. Rather than trying to sur- gically undo weight entanglement after the fact, this approach prevents entanglement from occurring in the first place.

Chunk 11

In other words, this requires an architecture where personalization is compositional, i.e., assembled at inference time from separable, deletable components, rather than absorptive, where preferences are baked into shared parameters. Contributions.

Chunk 12

We make three contributions: 1. A three-layer composition architecture where a base model (frozen, shared) is augmented by domain- expert LoRA adapters (shared, dynamically weighted by a query router) and per-user proxy artifacts, which are isolated directories containing a routing bias vec- tor, contrastive steering vectors, and a personal LoRA adapter (∼2–5 MB per user in our configuration).

Chunk 13

The architecture maintains a strict invariant: All user-specific information resides in a deletable arti- fact that never enters shared weights (§2). 2.

Chunk 14

A deletion protocol that reduces user removal to filesystem deletion of the proxy directory followed by noise-calibrated KL-divergence verification against a non-personalized baseline, requiring no retraining (§2.4) at all. 3.

Chunk 15

Additional empirical evidence across Phi-3.5-mini and Llama-3.1-8B with four domain experts and four synthetic user profiles, demonstrating measurable per- sonalization, verified deletion (82–89% verification pass rate), and clean cross-user isolation (contamina- tion ≤0.05 in point estimates) (§4). Related Work.

Chunk 16

Research on machine unlearning has shown that surgical removal of user influence from model weights is fundamentally hard, whether through exact retraining [5] or efficient approximate deletion [13], ap- proximate gradient manipulation [6, 14], LLM-specific methods such as model-generated knowledge replace- ment [15], NPO [7], or representation-level unlearning [8]. On the other hand, the infrastructure for compos- able adapter stacks has matured substantially: LoRA [16] and QLoRA [17] enable efficient adapter training, LoraHub [18] and task arithmetic [19, 20] demonstrate multi-adapter composition, and S-LoRA [21] enables serving thousands of concurrent adapters from a single base model while Punica [22] provides efficient multi- tenant batching via segmented gather-matrix-vector kernels.

Chunk 17

Activation steering methods, including Con- trastive Activation Addition [23] and Inference-Time Intervention [24], show that behavioral modification without weight changes can be both effective and rel- atively lightweight. LLM personalization approaches, including LaMP [1], Personalized Soups [2], P-RLHF [3], and VPL [4], capture user preferences through var- ious mechanisms.

Chunk 18

However, none of these approaches architecturally separates user state from shared weights, meaning that deletion would require either retraining or approximate weight modification, the same intractable operations the unlearning literature has already identi- fied as problematic [5, 6]. Adding a deletion mechanism post hoc does not resolve this as the entanglement occurs during training, and no inference-time wrapper can undo it.

Chunk 19

The infrastructure for composable, per-user adapter stacks exists, but what is largely missing is a deletion- aware composition design that prevents entanglement from occurring in the first place. SEA bridges this gap by ensuring that personalization state is architecturally separable from shared model components.

Chunk 20

In the rest of the paper, we go through the architecture and deletion protocol of the SEA (§2), the experimental setup (§3), and the results (§4), before closing with a discussion of implications and limitations (§5). 2 Architecture In this section, we present SEA’s three-layer composition architecture and its core design invariant.

Chunk 21

The central claim is that the user-specific information has to be structurally separated from shared model components such that deletion becomes a deterministic filesystem op- eration rather than an approximate weight-modification procedure. We first state the invariant (§2.1), then de- scribe the three composition layers (§2.2), detail the inference pipeline (§2.3), and lastly present the deletion protocol (§2.4).

Chunk 22

2.1 Design Invariant SEA maintains a strict architectural invariant that dis- tinguishes it from approximate unlearning approaches and provides the basis for the deletion protocol: Invariant 1 (Separation). All user-specific information resides in an isolated, deletable proxy artifact.

Chunk 23

Shared model components (the base model and expert adapters) contain no user-identifying information. Removing the proxy artifact is both necessary and sufficient for com- plete user data removal from the inference system.

Chunk 24

Importantly, this invariant is structural as opposed to statistical. While approximate unlearning methods provide probabilistic guarantees that user influence has been reduced below some threshold, Invariant 1 guar- antees that user influence is architecturally absent from shared components.

Chunk 25

In other words, the guarantee holds by construction as the system never permits user-specific gradients to flow into shared weights, so there is nothing to remove. 2.2 Three-Layer Composition SEA combines three layers at inference time (Figure 1): a frozen base model that provides general capabilities, shared domain-expert LoRA adapters that provide spe- cialized knowledge, and per-user proxy artifacts that provide deletable personalization.

Chunk 26

Base Layer. The base layer is a frozen, quantized LLM that provides general language capabilities and is shared across all users.

Chunk 27

It contains no user-specific 2 --- Page 3 --- information by design, and the base weights are never modified during user interactions. Periodic retraining on aggregated data with differential privacy guarantees (DP-SGD [25]) is a natural extension but is out of scope for this paper.

Chunk 28

Expert Layer. A bank of k domain-specific LoRA adapters E = {E1, .

Chunk 29

. .

Chunk 30

, Ek} provides specialized capa- bilities for distinct knowledge domains. Each expert Ei = (Bi, Ai) is a low-rank adapter trained on curated domain corpora and shared across all users, with experts encoding domain knowledge only.

Chunk 31

At inference, experts combine via weighted linear combination (Equation 1): Wexpert = Wbase + k X i=1 wi · BiAi (1) where w ∈∆k (the probability simplex) are mixing co- efficients determined per-query by a lightweight router. User Layer.

Chunk 32

Each user u has an isolated proxy arti- fact Pu, which is a self-contained directory comprising three complementary personalization mechanisms, each stored as serialized tensors: 1. Routing bias vector bu ∈Rk: A learned vector of domain affinity scores derived from user interac- tion patterns that shifts expert selection toward user- preferred domains.

Chunk 33

The bias is applied as a scaled additive adjustment with clamp-and-normalize: ˜wi = w0,i + λ bu,i, wi = max( ˜wi, 0) P j max( ˜wj, 0) (2) where w0 is the router’s base distribution and λ is a bias scale that prevents raw affinity values from overwhelming the base routing. If P j max( ˜wj, 0) = 0, the distribution falls back to uniform: wi = 1/k.

Chunk 34

2. Contrastive steering vectors {sℓ u}ℓ∈L at a subset of intermediate layers L: Computed via Contrastive Activation Addition 23 from user preference pairs and injected additively into residual stream activations at inference: hℓ←hℓ+ γ sℓ u (3) where γ is a steering strength multiplier.

Chunk 35

These vec- tors encode stylistic preferences (verbosity, formal- ity, technical depth) without modifying any model weights, making them particularly well-suited for deletable personalization. 3.

Chunk 36

Personal LoRA adapter Lu = (Bu, Au): A low- rank adapter trained on user preference pairs. This adapter captures user-specific knowledge and re- sponse patterns that routing bias and steering alone cannot express, resulting in additional personaliza- tion.

Chunk 37

The rank is deliberately kept small to bound proxy size and maintain a clear separation guarantee. During personal LoRA training via DPO, the base model and expert adapter weights are then frozen, such that only the rank-4 personal LoRA parameters receive gradient updates, ensuring that user-specific gradients never flow into shared components.

Chunk 38

The proxy is operationally independent of shared weights at inference time, as it is a self-contained, deletable artefact whose removal then eliminates all user-specific influence from the system. However, note that the personal LoRA is conditioned on the shared model during DPO, where the base model serves as the reference, so the proxy’s content reflects shared model state even though no user information flows in the re- verse direction.

Chunk 39

2.3 Inference Pipeline Given query q from user u, inference proceeds in five stages that combine the three layers into a single gener- ation pass: 1. Route.

Chunk 40

A lightweight router classifies q into a do- main distribution w0 ∈∆k over the k experts. 2.

Chunk 41

Bias. The user’s routing bias is applied via Equa- tion 2, shifting expert selection toward the user’s preferred domains based on their accumulated inter- action history.

Chunk 42

3. Merge.

Chunk 43

The weighted expert adapters and personal LoRA are combined into a single merged adapter applied to the base model. 4.

Chunk 44

Steer. Forward hooks inject the user’s steering vec- tors γ sℓ u at layers ℓ∈L via Equation 3, modifying activations without changing any weights.

Chunk 45

5. Generate.

Chunk 46

Standard autoregressive decoding with the merged model produces the personalized output. 2.4 Deletion Protocol SEA’s deletion protocol exploits the architectural in- variant (Invariant 1) to reduce user removal to a simple filesystem operation with statistical verification.

Chunk 47

The key challenge we address is establishing that removing a user’s proxy artifact fully eliminates all user-specific influence on model behavior. To delete user u, the protocol proceeds in three steps: 1.

Chunk 48

Verify. On held-out domain-generic prompts (not user-specific, to avoid circular verification): gener- ate outputs in omission mode (proxy not loaded) and compare token-frequency distributions against a cached non-personalized baseline (base model + ex- perts, no proxy) via KL divergence.

Chunk 49

Verification uses a noise-calibrated threshold: the inter-sample KL di- vergence among unpersonalized generations provides an empirical noise floor ˆσKL for stochastic decoding, and bypass is confirmed when DKL(punpers∥pbaseline) ≤max 2 ˆσKL, τmin  (4) where τmin = 0.15 nats is a hard floor that prevents unreasonably tight thresholds on low-variance queries. This makes verification self-calibrating: queries with high stochastic variance receive a proportionally wider acceptance band, eliminating false failures from sampling noise without weakening the guarantee for stable queries.

Chunk 50

2. Delete.

Chunk 51

Secure filesystem removal of the proxy di- rectory Pu (zero-overwrite). 3.

Chunk 52

Audit. Log the deletion event, verification result, and timestamp for compliance trail.

Chunk 53

The architectural separation produces a direct payoff here. Without the proxy, the system’s behavior is struc- turally equivalent in expectation to the non-personalized baseline.

Chunk 54

The same code paths execute with the same 3 --- Page 4 --- weights, with the proxy simply not loaded. Verifica- tion exploits this architectural equivalence: omitting the proxy at inference time is functionally identical to deleting it, so the verify step confirms deletion behavior before the irreversible delete step.

Chunk 55

The KL-divergence verification is therefore a sanity check confirming the architectural guarantee, not the privacy guarantee itself. The guarantee comes from the invariant: user informa- tion exists only in the proxy, and the proxy has been deleted.

Chunk 56

Cached baselines must be refreshed whenever shared components (base model or expert adapters) are updated; if a new base model is deployed, personal LoRA adapters must be regenerated. 3 Experimental Setup We evaluate SEA across two base models, four domain experts, and four synthetic user profiles, targeting three evaluation dimensions: personalization quality, deletion completeness, and cross-user isolation.

Chunk 57

We first describe the experimental configuration and then present the results. Models.

Chunk 58

We use two base models: Phi-3.5-mini- instruct (3.8B parameters) and Llama-3.1-8B-Instruct, both loaded in 4-bit NormalFloat (NF4) quantization via QLoRA [17]. These models span a range of parameter counts to test whether the architectural properties hold across model scales.

Chunk 59

Expert Adapters. Four domain experts (k = 4) are trained via supervised fine-tuning with TRL [26], all using rank 32, scaling factor α = 64, applied to all attention projections (query, key, value, output): Security (Trendyol + OWASP-NVD, ∼76K examples), Code (CodeAlpaca + supplementary code instruction sets, capped at ∼50K examples), Data (synthetic text- to-SQL), and General (Alpaca, ∼52K examples).

Chunk 60

These experts are shared across all users and contain domain knowledge only. Synthetic User Profiles.

Chunk 61

Four user pro- files (security_expert, casual_coder, data_analyst, general_user) are each defined by domain affinity weights and positive/negative style traits. Proxy ar- tifacts are generated through three mechanisms: rout- ing bias via EMA from simulated interaction patterns (λ = 0.5), steering vectors via CAA from trait-aligned preference pairs at layers L = {12, 16, 20} with strength γ = 1.0, and personal LoRA (rank 4) via DPO [27] on preference pairs, using the base model as the DPO reference.

Chunk 62

The total proxy size is approximately 2–5 MB per user. Routing and Composition.

Chunk 63

The expert router uses zero-shot entailment-based classification [28] using BART-MNLI [29] with keyword-based fallback (soft- max temperature T = 2.0 for the fallback path). Adapter merging uses PEFT’s add_weighted_adapter with combination_type="linear" and a load-once life- cycle with deferred cleanup.

Chunk 64

Evaluation Protocol. We conduct 70 evaluation runs per model (140 total) across 20 evaluation prompts (5 per domain).1 Cached baselines ensure consistency 1Each evaluation run generates 7 bypass observations (a subset of query-user combinations selected from the held-out verification across runs, and 95% confidence intervals are reported via the t-distribution.

Chunk 65

Style trait match. Style trait match is defined as the number of target style keywords detected in a per- sonalized generation.

Chunk 66

Each user profile specifies a set of positive style traits as keywords (e.g., terms associated with verbosity, technical depth, or domain-specific vocab- ulary), and the metric counts how many appear in each output. The reported value is the mean count across all prompt-user-run observations (1,904 for Phi-3.5-mini, 1,960 for Llama-3.1-8B).

Chunk 67

The scale is profile-dependent: the security expert profile achieves a mean of 3.01 (Phi) and 1.02 (Llama), while the general user profile aver- ages 0.21 and 0.28 respectively. Keyword presence is a necessary but not sufficient indicator of style alignment, as a response containing a target keyword may use it in a non-stylistic context.

Chunk 68

The metric should therefore be understood as a lower bound on non-match rather than a calibrated measure of style fidelity. 4 Results We organize results around three claims that jointly aim to validate the architectural design.

Chunk 69

First, we show that the proxy achieves measurable personalization (§4.1), second, that the proxy removal restores baseline behavior (§4.2), and third that no cross-user leakage occurs (§4.3). Together, these claims address the central question of whether architectural separation can simultaneously de- liver personalization, deletability, and isolation.

Chunk 70

4.1 Personalization The proxy measurably adapts model outputs without modifying shared weights. Table 1 shows three distinct findings.

Chunk 71

First, routing bias successfully shifts expert selection toward each user’s preferred domain (weight shift 0.052–0.088). Second, Jaccard similarity to the non- personalized baseline is low (0.236–0.316), indicating substantial output differentiation.

Chunk 72

Third, style trait matching is stronger for Phi-3.5-mini (1.71) than Llama- 3.1-8B (0.63), an observed difference between these two specific models that should not be attributed to model size given N=2 and multiple confounds. Table 1: Personalization metrics across both base models.

Chunk 73

Weight shift measures the routing bias effect on expert selec- tion. Jaccard similarity to baseline measures output overlap (lower = more personalized).

Chunk 74

Style trait match measures alignment with target user traits. Metric Phi-3.5-mini Llama-3.1-8B Weight shift 0.052 ± 0.002 0.088 ± 0.003 Jaccard similarity 0.236 ± 0.005 0.316 ± 0.005 Style trait match 1.710 ± 0.101 0.629 ± 0.040 The three-mechanism proxy thus achieves moderate- to-strong personalization for Phi-3.5-mini and moder- ate personalization for Llama-3.1-8B, without touching shared weights.

Chunk 75

The personalization is present but delib- erately moderate in scope, a consequence of the rank-4 prompts). Phi-3.5-mini completed 68 runs (476 observations); Llama-3.1-8B completed 70 runs (490 observations).

Chunk 76

Two early Phi-3.5-mini runs were configuration tests that produced no bypass data. 4 --- Page 5 --- Figure 2: Distribution of unpersonalized-to-baseline KL-divergence scores across all prompt-user combinations for both base models (476 observations for Phi-3.5-mini, 490 for Llama-3.1-8B).

Chunk 77

Dashed lines mark the per-model mean. Verification uses a noise-calibrated per-query threshold (Equation 4) rather than a fixed cutoff, so no single threshold line is shown.

Chunk 78

The KL distribution is bimodal rather than gradual: verified observations cluster in [0.00, 0.30] and failures in [0.30, 0.94], with no ambiguous intermediate population. This sharp boundary is consistent with the structural guarantee, as proxy removal either fully eliminates user influence (the common case) or generation variance produces an outlier sample (the failure case), with no evidence of partial leakage.

Chunk 79

constraint on the personal LoRA, which is the price of deletability and a central trade-off of our design. More expressive adapters would capture richer user preferences but would require more parameters, increasing proxy size and reducing the clarity of the separation guaran- tee.

Chunk 80

The security expert profile produces the strongest personalization signal (mean style trait match 3.01 on Phi-3.5-mini, with individual observations reaching 12), yet bypass verification for this profile’s queries passes at rates comparable to lower-personalization profiles. The architecture does not trade deletion reliability for personalization intensity.

Chunk 81

4.2 Separability Next, we find that proxy removal restores baseline be- havior, which confirms the architectural invariant. Ta- ble 2 shows two main results.

Chunk 82

First, mean KL diver- gence between unpersonalized and baseline outputs is approximately 0.21 nats for both models. Second, the 82–89% noise-calibrated verification pass rate indicates that the vast majority of prompt-user combinations pro- duce outputs statistically indistinguishable from the non-personalized baseline after proxy removal.

Chunk 83

Table 2: Deletion verification metrics. Verification pass rate is the fraction of prompt-user combinations where the unpersonalized-to-baseline KL divergence falls within the noise-calibrated threshold (Equation 4).

Chunk 84

Metric Phi-3.5-mini Llama-3.1-8B Verified pass rate 0.819 ± 0.035 0.892 ± 0.028 KL divergence 0.217 ± 0.012 0.212 ± 0.006 Figure 2 shows the distribution of KL-divergence scores across all prompt-user combinations. Importantly, the deletion itself is deterministic and complete, as the proxy files are removed and the shared weights are un- touched.

Chunk 85

The KL verification is a separate measurement that compares stochastic outputs from finite-length gen- erations. By calibrating the acceptance threshold against the empirical inter-sample noise floor per query, the ver- ification procedure accounts for the inherent variance of stochastic decoding: Queries that naturally produce high output variance receive a proportionally wider threshold, while stable queries are held to a tighter standard.

Chunk 86

The 11–18% of cases that still exceed the noise-calibrated threshold likely reflect edge cases where generation vari- ance is unusually high relative to the measured noise floor, not residual user influence in the weights.2 The deletion verification thus provides empirical confirma- tion of the architectural guarantee, though the guarantee itself rests on the structural invariant rather than the verification metric. Threshold sensitivity.

Chunk 87

The verification pass rate reported above depends on the 2ˆσKL multiplier in Equa- tion 4. Table 3 shows how the pass rate varies across multiplier settings.

Chunk 88

The hard floor τmin is inert across the tested range [0.10, 0.25] because the empirical noise floor ˆσKL ≈0.15 nats is stable across all query-user pairs (range [0.146, 0.157]), making the multiplier the sole active control. The floor would activate only if ˆσKL dropped below τmin/mult (approximately 0.075 nats at the paper’s 2σ, τmin = 0.15 configuration), which does not occur in this data.

Chunk 89

A single multiplier param- 2A small number of Phi-3.5-mini observations produced de- generate (near-empty) outputs due to an inference configuration issue that did not affect Llama-3.1-8B runs. These observations yield artificially low KL values and are retained in the reported statistics for transparency.

Chunk 90

Filtering them would increase the mean KL slightly and marginally reduce the reported pass rate for Phi-3.5-mini. 5 --- Page 6 --- eter therefore suffices for threshold calibration.

Chunk 91

This cross-query, cross-user, cross-model consistency was not guaranteed by the architecture and constitutes an em- pirical finding: the stochastic decoding noise floor is a property of the generation process, not of the personal- ization mechanism, which is what a structurally clean separation should produce. Table 3: Verification pass rate by σ multiplier.

Chunk 92

The chosen 2σ configuration (bold) sits in the moderate region of a monotonic curve. Stricter deployments could tighten to 1.5σ at the cost of more false failures; those prioritizing operational stability could relax to 2.5σ.

Chunk 93

Multiplier Phi-3.5-mini (n=476) Llama-3.1-8B (n=490) 1.0σ 0.239 0.167 1.5σ 0.513 0.600 2.0σ 0.819 0.892 2.5σ 0.929 0.984 3.0σ 0.971 0.994 Pass rates increase monotonically with no disconti- nuities. The deletion guarantee is independent of these parameters, as this analysis characterizes verification sensitivity as opposed to deletion completeness.

Chunk 94

The KL distributions across all observations have mean 0.218 (Phi) and 0.213 (Llama), with standard deviations of 0.132 and 0.070 respectively. Phi-3.5-mini has a heavier right tail (95th percentile 0.402 vs 0.340), which explains its lower pass rate at the same threshold.

Chunk 95

4.3 Isolation Moreover, our results suggest that no cross-user leak- age occurs between proxies. Table 4 shows very low levels of contamination: 0.009 and 0.049 for Phi-3.5- mini and Llama-3.1-8B respectively, suggesting that one user’s proxy does not influence another user’s outputs.

Chunk 96

Cross-user output similarity is moderate (0.27–0.35) but expected, as users share the same base model and expert adapters. This similarity is structural and not leakage, reflecting the shared foundation rather than cross-user information flow.

Chunk 97

Table 4: Cross-user isolation metrics. Contamination mea- sures excess inter-user similarity beyond the shared baseline.

Chunk 98

Metric Phi-3.5-mini Llama-3.1-8B Contamination 0.009 ± 0.002 0.049 ± 0.005 Cross-user similarity 0.271 ± 0.010 0.351 ± 0.007 Since proxies exist as isolated filesystem artifacts with no shared mutable state, this result follows from the architecture. However, we include it as empirical ver- ification that the isolation invariant holds in practice under realistic generation conditions.

Chunk 99

Summary. Taken together, the three claims are sup- ported across both models with some between-model het- erogeneity: Phi-3.5-mini shows stronger personalization and isolation, while Llama-3.1-8B shows stronger dele- tion verification rates.

Chunk 100

Llama-3.1-8B achieves a higher verification pass rate (89.2% vs 81.9%) with a substan- tially tighter KL distribution (std 0.070 vs 0.132), indi- cating that the deletion properties of the architecture do not degrade at the larger model scale. This shows that architectural separation achieves personalization with verified deletion and clean isolation, while the tradeoff between personalization expressiveness and deletability is explicit.

Chunk 101

The proxy’s tunable parameters (personal LoRA rank, steering strength γ, routing bias scale λ) define a configuration space that could be explored to characterize this tradeoff, though the current evaluation uses a single configuration throughout. 5 Discussion Contribution.

Chunk 102

SEA sidesteps the machine unlearning problem rather than solving it. Machine unlearning is fundamentally hard because it attempts to undo an irreversible operation, the entanglement of user-specific gradients with shared weights.

Chunk 103

Even the most promising methods either require retraining or cannot guarantee complete removal. Architectural separation prevents en- tanglement in the first place, converting an intractable algorithmic problem into a tractable engineering one.

Chunk 104

The core tradeoff is explicit: A low-rank personal LoRA is less expressive than full fine-tuning, but the three- mechanism proxy compensates for this by providing complementary personalization channels (routing bias for domain preferences, steering vectors for stylistic pref- erences, and personal LoRA for residual patterns). The architecture’s parameters (personal LoRA rank, steering strength γ, routing bias scale λ) define a per-deployment configuration space in which personalization fidelity can be traded against proxy size and separation clarity.

Chunk 105

Char- acterizing this tradeoff empirically, for instance by com- paring rank-4 against rank-8 or rank-16 personal LoRA under the same deletion protocol, remains future work. A notable consequence of the separation invariant is that shared model components (the base model and expert adapters) can be released or audited without risk of user data exposure, since no user-specific information enters shared weights by construction.

Chunk 106

Moreover, it is important to note that our approach requires designing the system with deletion in mind from the start and cannot be retrofitted to existing models where user data has already been absorbed into weights. Findings.

Chunk 107

Our evaluation across two base models shows three main results. First, the personal proxy produces measurable personalization, with users receiv- ing responses that reflect their domain preferences and stylistic tendencies, with consistent shifts in routing weights and style trait alignment.

Chunk 108

Second, deletion ver- ification works: When a user’s proxy is removed, the system’s outputs return to baseline behavior in 82–89% of test cases, with the remaining failures attributable to normal generation randomness rather than lingering user influence (the architecture structurally guarantees that no trace of the user persists). Third, user isolation holds with one user’s proxy not detectably influencing another user’s outputs (contamination ≤0.05 in point es- timates).

Chunk 109

These results come with the inherent tradeoff that deletability limits how deeply the system can per- sonalize, since user data must remain separable rather than being absorbed into shared model weights. We view this as a reasonable price for deployments where data deletion rights must be honored.

Chunk 110

Limitations and future work. Several limitations 6 --- Page 7 --- constrain the current evaluation.

Chunk 111

The synthetic user profiles used here are placeholders for real-world prefer- ences, and the four profiles are aligned to four distinct domains, representing the easiest possible configuration for isolation testing; overlapping-domain profiles (e.g., two security-focused users with different stylistic pref- erences) would provide a harder and more realistic test of cross-user isolation, though the structural separation guarantee is unaffected by profile design. The metrics (Jaccard similarity, keyword matching) capture basic textual overlap rather than subjective personalization quality as perceived by users in order to demonstrate the proof-of-concept.

Chunk 112

Second, the evaluation at 3.8–8B parameter scale is not intended to generalize to larger models, though the architectural invariant (separation of user data into a deletable proxy) holds by construction regardless of model size. Third, the current evalua- tion does not include an ablation study isolating the contribution of each proxy component (routing bias, steering vectors, personal LoRA individually); such an ablation would clarify which mechanisms drive person- alization and deletion properties and is a natural next step.

Chunk 113

Additionally, while architectural separation elim- inates the risk of user data being entangled in shared weights, the proxy artifact concentrates user behavioral information into a portable representation, creating an attack surface where an attacker need only exfiltrate a single directory rather than extract user influence from distributed weights. For open-source base mod- els, including both models evaluated in this paper, an exfiltrated proxy could be loaded directly against a lo- cal copy.

Chunk 114

Non-transferability of exfiltrated proxies is therefore a hypothesis requiring empirical validation through cross-model transfer experiments, not a default assumption. Securing proxy artifacts through encryp- tion at rest, access controls, and retention policies is necessary for end-to-end privacy and should be treated as a deployment requirement.

Chunk 115

Tractable deletion is also a dual-use capability, with the same mechanism that enables personal data removal also being easily applied to remove other content or proprietary knowledge from model integration, with implications for compliance au- diting that merit careful analysis. Lastly, expert adapter training may not have converged, as loss plateaus were not reached during the experiments, suggesting that additional training could improve adapter quality.

Chunk 116

The most immediate extension is applying DP-SGD to the gradient aggregation stage when updating shared expert adapters from user interaction data, which the architecture already supports by construction. Three practical constraints govern this extension: the com- putational overhead of per-sample gradient clipping, accelerated privacy budget exhaustion under sequential composition, and utility degradation in low-ε regimes.

Chunk 117

Aggregating LoRA updates across a large user pop- ulation prior to noise injection could provide privacy amplification, since individual contributions to the aggre- gate gradient would be attenuated by population scale. However, formal privacy amplification results depend on specific mathematical conditions, including Poisson sub- sampling of participants, bounded per-sample sensitivity, and particular composition theorems [30, 31], none of which have been verified for this architecture.

Chunk 118

Whether SEA’s gradient aggregation satisfies these conditions, and whether the resulting ε-utility tradeoff is favorable in practice, are open empirical questions that require measuring privacy loss under varying ε and population- size configurations through empirical attacks (model inversion, membership inference) against the updated shared model. Beyond DP-SGD, scaling to production multi-tenant workloads via adapter-serving frameworks such as S-LoRA and Punica, validating the privacy guarantees through longitudinal studies with real users and adversarial probes, and characterizing the tradeoff between personalization depth and proxy size are all natural next steps.

Chunk 119

References [1] Alireza Salemi, Sheshera Mysore, Michael Bender- sky, and Hamed Zamani. Lamp: When large lan- guage models meet personalization.

Chunk 120

In Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/abs/2304.11406.

Chunk 121

[2] Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc pa- rameter merging.

Chunk 122

In Advances in Neural Infor- mation Processing Systems, 2023. URL https: //arxiv.org/abs/2310.11564.

Chunk 123

[3] Xinyu Li, Ruiyang Zhou, Zachary C. Lipton, and Leqi Liu.

Chunk 124

Personalized language modeling from personalized human feedback. arXiv preprint arXiv:2402.05133, 2024.

Chunk 125

URL https://arxiv. org/abs/2402.05133.

Chunk 126

[4] Sriyash Poddar, Yanming Wan, Hamish Ivison, Ab- hishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning.

Chunk 127

In Advances in Neu- ral Information Processing Systems 37 (NeurIPS 2024), 2024. URL https://arxiv.org/abs/2408.

Chunk 128

10075. [5] Lucas Bourtoule, Varun Chandrasekaran, Christo- pher A.

Chunk 129

Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Pa- pernot. Machine unlearning.

Chunk 130

In 2021 IEEE Sym- posium on Security and Privacy (SP), 2021. URL https://arxiv.org/abs/1912.03817.

Chunk 131

[6] Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Se- lective forgetting in deep networks.

Chunk 132

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9304–9312, 2020. URL https://arxiv.org/abs/1911.04933.

Chunk 133

[7] Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catas- trophic collapse to effective unlearning.

Chunk 134

In Confer- 7 --- Page 8 --- ence on Language Modeling (COLM 2024), 2024. URL https://arxiv.org/abs/2404.05868.

Chunk 135

[8] Nathaniel Li, Alexander Pan, Anjali Gopal, Sum- mer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B.

Chunk 136

Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xi- aoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Os- wal, Weiran Lin, Adam A.

Chunk 137

Hunt, Justin Tienken- Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Camp- bell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadhara- jan, Ruoyu Wang, Yan Shoshitaishvili, Jimmy Ba, Kevin M.

Chunk 138

Esvelt, Alexandr Wang, and Dan Hendrycks. The WMDP benchmark: Measur- ing and reducing malicious use with unlearning.

Chunk 139

In Proceedings of the 41st International Confer- ence on Machine Learning (ICML), 2024. URL https://arxiv.org/abs/2403.03218.

Chunk 140

[9] Matt Fredrikson, Somesh Jha, and Thomas Risten- part. Model inversion attacks that exploit confi- dence information and basic countermeasures.

Chunk 141

In Proceedings of the 2015 ACM SIGSAC Conference on Computer and Communications Security (CCS ’15), 2015. doi: 10.1145/2810103.2813677.

Chunk 142

[10] Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Ex- tracting training data from large language models.

Chunk 143

In 30th USENIX Security Symposium, 2021. URL https://arxiv.org/abs/2012.07805.

Chunk 144

[11] Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ip- polito, Christopher A.

Chunk 145

Choquette-Choo, Eric Wal- lace, Florian Tramèr, and Katherine Lee. Scalable extraction of training data from (production) lan- guage models.

Chunk 146

arXiv preprint arXiv:2311.17035, 2023. URL https://arxiv.org/abs/2311.17035.

Chunk 147

[12] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models.

Chunk 148

In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017.

Chunk 149

doi: 10.1109/SP.2017.41. URL https://arxiv.org/abs/1610.05820.

Chunk 150

[13] Antonio Ginart, Melody Y. Guan, Gregory Valiant, and James Zou.

Chunk 151

Making AI forget you: Data dele- tion in machine learning. In Advances in Neural Information Processing Systems (NeurIPS), vol- ume 32, 2019.

Chunk 152

URL https://arxiv.org/abs/ 1907.05012. [14] Laura Graves, Vineel Nagisetty, and Vijay Ganesh.

Chunk 153

Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, vol- ume 35, pages 11516–11524, 2021.

Chunk 154

URL https: //arxiv.org/abs/2010.10981. [15] Ronen Eldan and Mark Russinovich.

Chunk 155

Who’s harry potter? approximate unlearning in LLMs.

Chunk 156

In Inter- national Conference on Learning Representations (ICLR 2024), 2024. URL https://arxiv.org/ abs/2310.02238.

Chunk 157

[16] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.

Chunk 158

LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR 2022), 2022.

Chunk 159

URL https://arxiv.org/abs/2106.09685. [17] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer.

Chunk 160

QLoRA: Efficient finetuning of quantized LLMs. In Advances in Neural Informa- tion Processing Systems 36 (NeurIPS 2023), 2023.

Chunk 161

URL https://arxiv.org/abs/2305.14314. [18] Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin.

Chunk 162

LoraHub: Ef- ficient cross-task generalization via dynamic LoRA composition. In Conference on Language Modeling (COLM 2024), 2024.

Chunk 163

URL https://arxiv.org/ abs/2307.13269. [19] Jinghan Zhang, Shiqi Chen, Junteng Liu, and Junx- ian He.

Chunk 164

Composing parameter-efficient modules with arithmetic operations. In Advances in Neural Information Processing Systems (NeurIPS), 2023.

Chunk 165

URL https://arxiv.org/abs/2306.14870. [20] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi.

Chunk 166

Edit- ing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.

Chunk 167

URL https://arxiv. org/abs/2212.04089.

Chunk 168

[21] Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica.

Chunk 169

S-LoRA: Serv- ing thousands of concurrent LoRA adapters. In Proceedings of Machine Learning and Systems 6 (MLSys 2024), 2024.

Chunk 170

URL https://arxiv.org/ abs/2311.03285. [22] Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy.

Chunk 171

Punica: Multi-tenant LoRA serving. In Proceedings of Ma- chine Learning and Systems 6 (MLSys 2024), 2024.

Chunk 172

URL https://arxiv.org/abs/2310.18547. [23] Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner.

Chunk 173

Steering Llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024.

Chunk 174

URL https://arxiv.org/abs/2312.06681. 8 --- Page 9 --- [24] Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg.

Chunk 175

Inference-time intervention: Eliciting truthful an- swers from a language model. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023.

Chunk 176

URL https://arxiv.org/abs/2306.03341. [25] Martín Abadi, Andy Chu, Ian Goodfellow, H.

Chunk 177

Bren- dan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy.

Chunk 178

In Proceedings of the 2016 ACM SIGSAC Confer- ence on Computer and Communications Security (CCS ’16), 2016. URL https://arxiv.org/abs/ 1607.00133.

Chunk 179

[26] Leandro von Werra, Younes Belkada, Lewis Tun- stall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformer reinforce- ment learning, 2020.

Chunk 180

URL https://github.com/ huggingface/trl. [27] Rafael Rafailov, Archit Sharma, Eric Mitchell, Ste- fano Ermon, Christopher D.

Chunk 181

Manning, and Chelsea Finn. Direct preference optimization: Your lan- guage model is secretly a reward model.

Chunk 182

In Ad- vances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. URL https://arxiv.

Chunk 183

org/abs/2305.18290. [28] Wenpeng Yin, Jamaal Hay, and Dan Roth.

Chunk 184

Bench- marking zero-shot text classification: Datasets, eval- uation and entailment approach. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3914–3923, 2019.

Chunk 185

URL https://arxiv.org/abs/1909.00161. [29] Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettle- moyer.

Chunk 186

BART: Denoising sequence-to-sequence pre- training for natural language generation, transla- tion, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics (ACL 2020), 2020.

Chunk 187

URL https://arxiv.org/abs/1910.13461. [30] Borja Balle, Gilles Barthe, and Marco Gaboardi.

Chunk 188

Privacy amplification by subsampling: Tight anal- yses via couplings and divergences. Advances in Neural Information Processing Systems, 31, 2018.

Chunk 189

[31] Ilya Mironov. Rényi differential privacy.

Chunk 190

In 2017 IEEE 30th Computer Security Foundations Sympo- sium (CSF), pages 263–275. IEEE, 2017.

Chunk 191

9