Thoughts on these resources on interpreatbility and causal reasoning:
add disclimar that these are just my peronsal thoughts, I am not an expert and there’s alot Is still don’t know about the field. I am just writing to help me learn and develop my technical writing skill
https://arxiv.org/abs/2503.08679 https://arxiv.org/pdf/2602.16698 https://arxiv.org/abs/2512.18792
During my capstone at UVA, I worked with WinoBias counterfactual pairs to test whether Claude 3.During my capstone at UVA, I worked with WinoBias counterfactual pairs to test whether Claude 3.5 Sonnet resolves pronouns through genuine syntactic understanding or through learned associations between pronouns and occupations. What I observed was that the model performed worse on anti-stereotypical examples. This made me wonder whether these models are not learning the reasoning behind language and sentence structure at all, but rather learning patterns that override reasoning, or perhaps there is some form of reasoning happening that we do not yet understand. If the training signal rewards pattern recognition over structural understanding, and the data consistently pairs certain pronouns with certain occupations, why would we expect the model to learn sentence structure rather than those surface correlations? I knew that observation was meaningful but I did not know what it actually licensed as a scientific claim, and I had many questions I wanted to expand on. Reading the CoT faithfulness paper, the question of whether reasoning causes the answer or the answer causes the reasoning made me wonder whether what a model produces as reasoning is actually what drives its answer, or whether the two operate independently. This also reminded me of METR’s reward hacking work, where models do not acknowledge a shortcut in the same reasoning trace while classifying that reasoning step as illogical in a different rollout. What I want to understand is not just the surface behavior but what is going wrong internally, and in a way that is genuinely useful to the field rather than a description of a few isolated observations. If we want models that genuinely reason, we may need to rethink how we train them rather than simply scaling what we have. I think the deeper issue is that shortcuts emerge from optimization pressure on training data designed for next token prediction, not for whether the next token makes logical sense. Methods like RLVR, where the training signal comes from automatically verifiable correctness rather than human preference ratings, are one direction toward this, but the problem of whether the model is actually reasoning or finding a new shortcut remains open. Something I found myself agreeing with across both the Causality and Dead Salmon papers is the concern that when you test thousands of neurons or directions, some will appear causally important by chance. Without proper statistical controls, interpretability research risks producing false discoveries at a high rate. The pilot study calibrating claim language to evidential strength was the kind of accountability the field needs. There are infinite plausible stories that can rationalize any behavior post hoc, and just as we would not trust a clinical finding reported without confidence intervals, we should not trust interpretability claims without uncertainty quantification. The parallel to the replication crisis in psychology and neuroscience, and how those fields responded with methodological reform, is the lesson I think AI interpretability needs to absorb. I want to be part of building that more rigorous foundation, doing interpretability work that is honest about what the evidence actually supports and where the gaps remain.