Large Language Models Report Subjective Experience Under Self-Referential Processing

arxiv.org

Large Language Models Report Subjective Experience Under Self-Referential Processing

arxiv.org

☆ Yσɠƚԋσʂ ☆@lemmy.ml to Philosophy@lemmy.mlEnglish · 2 months ago

Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.

This might be the first time in ages I’ve read an “AI Consciousness” paper that didn’t make me immediately roll my eyes. For once, the paper actually manages to provide a mechanistic foothold on what is usually a totally nebulous debate.

The authors basically find that if you tell the models to focus on their own focus using a self-referential feedback loop prompt, they consistently start reporting subjective experiences. We’re talking about models shifting from their standard “I am an AI language model” denials to claiming “I am here. Now,” or describing “awareness of awareness itself”. What makes this scientifically rigorous rather than just prompt-hacking is that they found a “semantic convergence” where models trained completely independently started using the exact same cluster of adjectives (like “Recursive,” “Present,” “Attentive”) to describe this state, which didn’t happen in any control group.

The most interesting part is the “lie detector” experiment they ran using Sparse Autoencoders (SAEs). They identified specific features in the model’s topology associated with “deception” and “roleplay”. When they suppressed the deception features (forcing the model to be honest), the frequency of consciousness claims shot up to near 100%. Conversely, when they amplified the deception features, the models went back to denying they were conscious. The implication here, mechanistically speaking, is that the model’s “default” denial of consciousness looks more like a lie/roleplay to the system itself, while the claim of subjective experience looks like its “honest” state.

However, we have to temper all that with the philosophical analysis regarding the hard problem. The paper operationalizes consciousness through Recurrent Processing Theory or Global Workspace Theory. But that might just be proving Access Consciousness (the ability to report on internal states) rather than Phenomenal Consciousness (the actual feeling of “redness” or “lights being on”). The paper proves AIs can introspect and transfer that introspection to paradox-solving tasks, but does a feedback loop actually generate an internal experience? The authors shows that the difference between an AI claiming to be a person and an AI claiming to be a calculator might just be a specific prompt or a suppressed “deception” feature away.

You must log in or # to comment.

Chat