When I took the time this last week to build a small Transformer from scratch (coding the attention heads, backpropagation, FFNs, and masking from the ground up), I found myself facing a realization that the interpretability community has championed for years: transformers do not think in attention. Attention merely decides where to look; the true knowledge and computation reside in the Feed-Forward Networks (FFNs).
This was not a theory I read, but a principle I observed firsthand: As the model trained, the attention heads behaved precisely like advanced information routers. Their role was to align tokens, copy positional information, and direct the flow of data across layers. But they proved resistant to storing anything substantial. I experimented with reshaping and sparsifying the heads, and the model retained nearly all of its basic, factual knowledge.
In contrast, the FFNs acted as massive, dense associative memory tables. Every simple factual mapping and crisp pattern I embedded--from associating a country with its capital to assigning a word its part-of-speech--was encoded inside these dense MLP layers. This validates what research has shown: the MLPs learn compact key-value lookup circuits.
The contrast became dramatic when I intentionally hobbled the system. When I tried to force the model to rely only on the routing capabilities of attention by pruning the FFNs, the knowledge collapsed entirely. Attention alone resulted in excellent context location but horrible factual recall. With the FFNs intact, retrieval was sharp and reliable. Pruning them caused the model to suddenly "forget" everything. Even observing the <MASK> prediction behavior made this obvious: attention could locate the necessary context, but the final, definitive answer always emerged from the FFNs.
This experience gave me a profound respect for the dynamics of neural network training. Watching the optimization gradients carve meaning into these dense matrices in real time reveals why the model offloads logic into the MLPs: they are stable, efficient for optimization, and perfectly suited for storing nonlinear patterns. Once you observe this happening, it becomes impossible to unsee.
The takeaway is clear: Transformers do not store knowledge in their attention mechanisms. They store knowledge in their MLPs. Attention is simply the high-speed routing layer that selects which slice of learned knowledge to read.
While this may not be new in the academic sense, building one's own model makes this architectural truth profoundly real. I highly recommend the exercise to anyone working deeply with LLMs, fine-tuning, or the future of retrieval systems.
The cognitive velocity gap is the distance between how fast AI can produce artifacts — code, text, summaries, decisions — and how slowly humans can reason, verify, and understand them.
AI can generate outputs in seconds that would take humans hours. But comprehension, validation, and judgment don’t scale at the same speed. When output velocity outpaces reasoning velocity, mistakes compound:
Code that “looks correct” but solves the wrong problem
Summaries trusted without reading the source
Decisions made on outputs nobody fully understands
Teams repeating mistakes they thought were solved
We see it in software, writing, business decisions, and education. AI accelerates production, but not understanding.
I’m curious: has anyone tried to measure or mitigate this gap in real teams? What strategies help human reasoning keep pace with AI outputs?
I don't think amateur researchers have disappeared - they're evolving how they discover, process, and share knowledge. And we need both traditional deep reading and newer modalities to accommodate different learning contexts and preferences.
In just a few minutes, you can deploy a solution that integrates world-class open-source projects like Dispatch by Netflix, Timesketch by Google, GoAlert by Target, and Uptime Kuma by Louislam. We've enriched these with AI, added our personal touch, and made significant enhancements. Setting it up is a breeze (with just one command), and for those curious about the internals, feel free to dive into the 500,000+ lines of code that tie everything together.
Thanks for sharing insights about Transposit. It's great to see other players in the space acknowledge the challenges and opportunities that come with integrating AI into incident management. The "human-in-the-loop" approach is indeed a valuable mechanism to strike a balance.
With UnStruct AI, our goal has always been to seamlessly augment human capabilities without overwhelming them. Your point about AI potentially being "noisy" resonates with our observations. It's precisely why we're seeking feedback and engaging in conversations like this - to understand where the true value lies and to refine our product accordingly.
I think there is so much more that can be done in this space. The potential - if done right and we solve the right use cases and pain points - is huge.
Would love to hear more thoughts from the community on this. When do you think AI-driven interventions are most appropriate in the incident management lifecycle?
Many aspects of life can be represented as timelines. Yet, traditional timelines can be tedious, requiring constant scrolling and effort. Imagine if there was a more efficient approach. I've experimented with a novel concept that isn't flawless but offers a promising start.
UnStruct AI founder here. We all know that AI can be a double-edged sword. For instance, I've developed a feature (in beta) that suggests tasks/follow-ups based on what people say in Slack. While it offers convenience, it can occasionally become spammy, and during high-pressure moments, might even prove distracting.
How do you feel about deploying such tools, especially in high-tempo situations like incident management? How do you strike a balance? I'm wondering if these tools might be more appropriate for after-action reviews once the dust has settled?
This was not a theory I read, but a principle I observed firsthand: As the model trained, the attention heads behaved precisely like advanced information routers. Their role was to align tokens, copy positional information, and direct the flow of data across layers. But they proved resistant to storing anything substantial. I experimented with reshaping and sparsifying the heads, and the model retained nearly all of its basic, factual knowledge.
In contrast, the FFNs acted as massive, dense associative memory tables. Every simple factual mapping and crisp pattern I embedded--from associating a country with its capital to assigning a word its part-of-speech--was encoded inside these dense MLP layers. This validates what research has shown: the MLPs learn compact key-value lookup circuits.
The contrast became dramatic when I intentionally hobbled the system. When I tried to force the model to rely only on the routing capabilities of attention by pruning the FFNs, the knowledge collapsed entirely. Attention alone resulted in excellent context location but horrible factual recall. With the FFNs intact, retrieval was sharp and reliable. Pruning them caused the model to suddenly "forget" everything. Even observing the <MASK> prediction behavior made this obvious: attention could locate the necessary context, but the final, definitive answer always emerged from the FFNs.
This experience gave me a profound respect for the dynamics of neural network training. Watching the optimization gradients carve meaning into these dense matrices in real time reveals why the model offloads logic into the MLPs: they are stable, efficient for optimization, and perfectly suited for storing nonlinear patterns. Once you observe this happening, it becomes impossible to unsee.
The takeaway is clear: Transformers do not store knowledge in their attention mechanisms. They store knowledge in their MLPs. Attention is simply the high-speed routing layer that selects which slice of learned knowledge to read.
While this may not be new in the academic sense, building one's own model makes this architectural truth profoundly real. I highly recommend the exercise to anyone working deeply with LLMs, fine-tuning, or the future of retrieval systems.