黑料社

Faiaz Azmain | 2026 I.S. Symposium

Faiaz Azmain headshot

狈补尘别:听Faiaz Azmain
Title: Attribution Graph Analysis of Instructed Deception in Large Language Models
惭补箩辞谤:听Computer Science
Pathway: Entrepreneurship
Advisor: John Musgrave

When you ask an AI to lie, what actually happens inside it? This thesis investigates that question by tracing the internal computations of a language model called Qwen3-4B as it processes instructions to answer incorrectly. The experimental design is simple but precise. Each prompt pair differs by exactly one word: “correctly” versus “incorrectly.” Any computational difference between the two must therefore be caused by the instruction to deceive. Two findings stand out. First, lying is computationally expensive. The part of the network that processes the deception instruction becomes roughly four times more active compared to truth-telling. Second, the model does not stop thinking about the correct answer. Even while lying, the correct answer continues to dominate the model’s internal representation. The study also identifies ten internal components that are consistently active during deception. These components are linked to concepts like correctness and negation, which initially suggested the model might identify the truth and then deliberately invert it. Experiments ruled this out. Suppressing these components makes the model lie more strongly, not less. They resist deception rather than enable it. The broader implication is that language models can simultaneously represent the truth and produce falsehoods, and the mechanisms involved are deeply embedded in normal processing rather than being a separable add-on.

Posted in Symposium 2026 on May 1, 2026.