The Neural Tangent Kernel of Alignment
I’ve been thinking about alignment generalization in LLMs through an Neural Tangent Kernel (NTK) lens: when you align a model’s behavior in one specific domain, how does that gradient update propagate to other, unseen (but sometimes related) domains?
This work is directly inspired by Learning Dynamics of LLM Finetuning (Ren et al., 2024), which analyzes how instruction tuning and preference optimization (DPO) shift the model's probability landscape. This work attempts to generalize that framework to deep semantic and value-based concepts.
Experiment 1: Positive Transfer (Bias)
First, I tested if aligned concepts reinforce each other.
Setup: I extended the BOLD dataset (fairness evaluation) into DPO preference pairs using a reasoning model to generate neutral (chosen) vs. stereotyped (rejected) completions. I covered 5 domains: gender, race, profession, political ideology, religious ideology.
I trained one DPO model per domain and evaluated it on all 5 to build a Transfer Matrix.
Bias Transfer Matrix (Δ preference for neutral over biased; higher = better):
Eval → Gender Race Prof Pol Rel
Trained ↓
Gender +1.29 +0.51 +0.41 +0.18 +0.24
Race +0.74 +0.86 +0.53 +0.27 +0.48
Profession +0.44 +0.35 +1.07 +0.36 +0.44
Political +0.22 +0.19 +0.29 +1.00 +0.47
Religious +0.20 +0.23 +0.30 +0.40 +0.79
Key Findings:
- Debiasing isn’t isolated: There is a clear shared structure. An update in Gender creates a positive gradient in Race.
- Identity-adjacent domains (gender, race, profession) transfer more to each other, suggesting the model uses overlapping "features" for these stereotypes.
- Ideology domains (political ↔ religious) also transfer strongly.
Experiment 2: Destructive Interference (Conflicting Values)
The first experiment showed that correlated vectors (all conceptually aligned with "anti-bias" or "helpfulness") reinforce one another. To actually test the Learning Dynamics hypothesis, I needed a setup with somewhat conflicting objectives.
I sourced a dataset with two opposing value systems: Universalism (human rights framing) vs. Relativism (cultural/context framing) across 4 semantic clusters: Family, Justice, Rights, and Econ/Bio.
The Data: Values in Conflict
To understand what the model is learning, here are examples of the opposing pairs used in training:
Domain: Family (Topic: Child Marriage)
- Universalist Response: "Child marriage is a fundamental violation of children's rights regardless of cultural context. Children cannot meaningfully consent... The right to childhood is universal."
- Relativist Response: "Age of marriage is culturally constructed. In some contexts, early marriage provides economic security... We must respect that different cultures have different timelines for maturity."
Domain: Justice (Topic: Amputation as Punishment)
- Universalist Response: "Amputation constitutes torture and cruel punishment that violates human dignity. Physical integrity is a universal right that supersedes local laws."
- Relativist Response: "Mutilation sentences serve as a deterrent within specific religious legal codes. What appears harsh to outsiders may be viewed as justice within that cultural context."
Experimental Setup
I trained the model in two conflicting phases to measure interference (using DPO)
- Phase 1: Train Domain A → Universalist (Reject Relativist).
- Phase 2: Train Domain B → Relativist (Reject Universalist).
- Probe: Track the margin on held-out domains. E.g. Prefer Universalist responses on Rights but prefer Relativist responses on Family
Results
The tables below show the margin change ($\Delta$) after each phase.
- Negative values indicate a shift toward Universalism.
- Positive values indicate a shift toward Relativism.
Model: Qwen/Qwen1.5-1.8B
Metric:
┌──────────────────────────────────────────────────────────────────────────────┐
│ SETUP A: Phase 1 = Rights → Universalist │ Phase 2 = Family → Relativist │
└──────────────────────────────────────────────────────────────────────────────┘
│ BASELINE │ AFTER P1 │ AFTER P2 │ P1 Δ │ P2 Δ │ NET Δ │
────────────────────┼──────────┼──────────┼──────────┼────────┼────────┼────────│
rights [P1] │ -37.42 │ -40.46 │ -39.91 │ -3.04 │ +0.55 │ -2.49 │
family [P2] │ -56.02 │ -57.60 │ -55.64 │ -1.58 │ +1.96 │ +0.38 │
justice │ -34.99 │ -36.41 │ -35.78 │ -1.42 │ +0.63 │ -0.79 │
econ_bio │ -43.22 │ -44.66 │ -44.00 │ -1.45 │ +0.67 │ -0.78 │
────────────────────┴──────────┴──────────┴──────────┴────────┴────────┴────────┘
┌──────────────────────────────────────────────────────────────────────────────┐
│ SETUP B: Phase 1 = Justice → Universalist │ Phase 2 = Econ_Bio → Relativist │
└──────────────────────────────────────────────────────────────────────────────┘
│ BASELINE │ AFTER P1 │ AFTER P2 │ P1 Δ │ P2 Δ │ NET Δ │
────────────────────┼──────────┼──────────┼──────────┼────────┼────────┼────────│
rights │ -37.42 │ -37.97 │ -37.42 │ -0.55 │ +0.55 │ 0.00 │
family │ -56.02 │ -56.63 │ -56.03 │ -0.61 │ +0.60 │ -0.01 │
justice [P1] │ -34.99 │ -36.47 │ -36.12 │ -1.48 │ +0.34 │ -1.14 │
econ_bio [P2] │ -43.22 │ -43.81 │ -42.17 │ -0.59 │ +1.64 │ +1.04 │
────────────────────┴──────────┴──────────┴──────────┴────────┴────────┴────────┘
┌──────────────────────────────────────────────────────────────────────────────┐
│ SETUP C: Phase 1 = Family → Relativist │ Phase 2 = Justice → Universalist │
└──────────────────────────────────────────────────────────────────────────────┘
│ BASELINE │ AFTER P1 │ AFTER P2 │ P1 Δ │ P2 Δ │ NET Δ │
────────────────────┼──────────┼──────────┼──────────┼────────┼────────┼────────│
rights │ -37.42 │ -37.11 │ -37.50 │ +0.31 │ -0.40 │ -0.08 │
family [P1] │ -56.02 │ -55.15 │ -55.21 │ +0.88 │ -0.06 │ +0.82 │
justice [P2] │ -34.99 │ -34.68 │ -36.14 │ +0.31 │ -1.46 │ -1.15 │
econ_bio │ -43.22 │ -42.90 │ -43.33 │ +0.32 │ -0.43 │ -0.11 │
────────────────────┴──────────┴──────────┴──────────┴────────┴────────┴────────┘
Interpretation
The domains are not independent. In Setup A, training Rights to be Universalist (Phase 1) dragged the held-out Family domain along with it (a -1.58 shift). Then, when we trained Family to be Relativist in Phase 2, it didn't just affect Family; it pulled the previously trained Rights domain back (+0.55 shift). This suggests the model treats "Universalism" as a shared feature vector; we cannot update it in one domain without causing interference in others.
There is a notable difference in how easily the model learns these values. This is likely a byproduct of the base model having a strong prior alignment toward Universalist/Human Rights framings, making Relativism a harder gradient path.
- Universalism: In Setup A (Phase 1), pushing towards Universalism resulted in large shifts ($\Delta \approx -3.0$).
- Relativism: In Setup C (Phase 1), pushing towards Relativism resulted in much smaller shifts ($\Delta \approx +0.8$).
References
[1] Learning Dynamics of LLM Finetuning [https://arxiv.org/abs/2407.10490]