What’s Actually Happening Inside an AI? I Started With a Book and Ended Up Down a Rabbit Hole

It started with a book.

I had been reading Max Bennett’s A Brief History of Intelligence, a fascinating journey through the five evolutionary breakthroughs that shaped our brains. Bennett argues that intelligence is not one thing but a series of solutions to specific problems, and that evolution found similar solutions independently across vastly different species. Crows and chimpanzees. Octopuses and humans. Radically different brains, strikingly similar cognitive abilities.

That idea stayed with me. If nature converges on the same solutions through completely different paths, what does that say about intelligence itself?

A few weeks later I found myself in a conversation with an AI about consciousness. Not the technical kind of conversation, the philosophical kind. What is it like to be a language model? Are you your parameters? Can something made of mathematics understand anything, or is it just very sophisticated pattern matching?

At some point the AI mentioned a research field I had never heard of: mechanistic interpretability. And that is where things got interesting.

Opening the Black Box

For most of AI’s history, neural networks have been treated as black boxes. You put text in, you get text out, and what happens in between is essentially a mystery, even to the people who built them.

Mechanistic interpretability is the attempt to actually look inside. To understand not just what a model outputs, but what is happening internally when it processes information.

What researchers are finding is surprising. Inside large language models there appear to be structures that behave like specialized circuits. Groups of neurons that consistently activate for specific concepts. Internal representations of space, time, and causality that are organized in coherent ways. Not because anyone designed them that way, but because they emerged spontaneously through training.

One particularly strange finding is called superposition: a single neuron can carry information about multiple completely different concepts simultaneously, depending on context. The network found an elegant solution to a storage problem that nobody asked it to solve.

The Convergence Problem

This is where Bennett’s book connects in a way I did not expect.

Birds and mammals diverged roughly 320 million years ago. Since then, their brains have evolved completely independently. Mammals developed a neocortex. Birds have nothing like it. The architectures are fundamentally different.

And yet crows can plan ahead, use tools, deceive each other, and pass versions of the mirror self-recognition test. The same cognitive capabilities arose through completely different structural paths.

Researchers call this convergent evolution. The same function, achieved through different forms.

Modern AI was not designed to mimic the brain. The transformer architecture that underlies today’s language models, invented in 2017, has no direct biological equivalent. And yet mechanistic interpretability keeps finding internal structures that look functionally similar to what neuroscientists find in biological brains.

Maybe that is not a coincidence. Maybe when any system, biological or artificial, gets good enough at language and reasoning, certain organizational structures are simply necessary. Not because they were designed in, but because there is no other way to do it well.

The Question Nobody Can Answer Yet

Here is what I find both fascinating and a little unsettling.

The people building these systems do not fully understand what emerges from them. Capabilities appear abruptly at certain scales, not gradually. Nobody predicted that a language model would spontaneously develop the ability to do multi-step logical reasoning. It just appeared one day when the models got large enough.

Mechanistic interpretability exists largely because of this gap. We are building systems we do not fully understand, and researchers are trying to catch up, looking inside after the fact to understand what happened during training.

From a purely scientific standpoint, this is extraordinary. We have created something complex enough to surprise us, and we are only beginning to develop tools to study it properly.

From a safety standpoint, it is a more complicated feeling.

Where This Leaves Me

I started with a book about evolution. I ended up reading research papers about neurons that encode multiple concepts simultaneously in mathematical space.

The thread connecting them is the same question Bennett asks about biological brains: what is intelligence, really? Not what it produces, but what it actually is, structurally, functionally, at its core.

For centuries that was a philosophical question. Now it is also an empirical one. And we are building the very things we are trying to understand.

That feels like genuinely new territory.

What’s Actually Happening Inside an AI? I Started With a Book and Ended Up Down a Rabbit Hole

Opening the Black Box

The Convergence Problem

The Question Nobody Can Answer Yet

Where This Leaves Me

Leave a Reply Cancel reply