We are finally beginning to understand how LLMs work
Full title: We are finally beginning to understand how LLMs work: No, they don't simply predict word after word Anthropic just ran Claude through a brain scanner
In context: The constant improvements AI companies have been making to their models might lead you to think we've finally figured out how large language models (LLMs) work. But nope LLMs continue to be one of the least understood mass-market technologies ever. But Anthropic is attempting to change that with a new technique called circuit tracing, which has helped the company map out some of the inner workings of its Claude 3.5 Haiku model.
Circuit tracing is a relatively new technique that lets researchers track how an AI model builds its answers step by step like following the wiring in a brain. It works by chaining together different components of a model. Anthropic used it to spy on Claude's inner workings. This revealed some truly odd, sometimes inhuman ways of arriving at an answer that the bot wouldn't even admit to using when asked.
All in all, the team inspected 10 different behaviors in Claude. Three stood out.
One was pretty simple and involved answering the question "What's the opposite of small?" in different languages. You'd think Claude might have separate components for English, French, or Chinese. But no, it first figures out the answer (something related to "bigness" ) using language-neutral circuits first, then picks the right words to match the question's language.
https://www.techspot.com/news/107347-finally-beginning-understand-how-llms-work-no-they.html
The second example was the
very strange way that Claude added two numbers together. "However, if you ask Claude how it solved the problem, it'll confidently describe the standard grade-school method, concealing its actual, bizarre reasoning process."
The third example, "Poetry is even stranger."
We're on the verge of giving LLM-based AI control of a lot of things, and we know practically nothing about how they do what they do!