Progress in artificial intelligence has come in the form of a bag of tricks, some of which have proven to be unreasonably effective: word embeddings as a way of representing semantic similarity; the attention mechanism for contextualization; self-supervised learning on curated datasets; instruction tuning of models; reinforcement learning based on human feedback. The key to making these tricks “unreasonably effective” has turned out to be scale, such as taking the number of parameters in models up by several orders of magnitude.
Despite the amazing recent progress, many applications will require further step function improvements in accuracy, reliability and creativity. Is even more scale all we need? This is one way to read Rich Sutton’s famous “The Bitter Lesson” essay. Will more scale give us the levels of accuracy and reliability, including absence of hallucinations, required for legal work or medical advice? Will more scale provide the creativity to create new scientific knowledge, such as novel proofs?
Or might we need some more tricks. And would those be entirely new tricks or ones we already know but haven’t fully integrated yet into our current bag. Another way to read the essay is simply as an exhortation to look for methods that can scale arbitrarily with computation (Sutton calls out search and learning). Of course it could also turn out that we need some tricks that don’t rely on scaling in quite the same way or at all.
Here is a list of known candidates that seem promising:
1. Learning in simulated environments
Humans learn a lot from living in the world. Robots exploring real or simulated worlds are learning more about causality. Simulation is already being used extensively in training models for autonomous vehicles. There is also work on more general simulation, such as ThreeDWorld out of MIT (original paper from 2020). Simulation can scale arbitrarily with computation and thus fits with Sutton’s conclusion. A recent fun and compelling result by Jim Fan is a robot dog walking on top of an exercise ball simply based on simulation.
2. Multi-agent learning
Much of human behavior and learning arises from interactions with other humans. Multiagent environments are showing promise for similar effects among models, as recently demonstrated in a sandbox inspired by The Sims (paper and code). Multi agent learning too can keep scaling with computation and also leverage simulation.
3. Reliable use of tools
One of the earliest departures for humans from (most) other species comes in our use of tools. Today very few people would attempt multiplying large numbers in their head. We would use a calculator instead. A promising direction is training models to reliably know when and how to use such tools (see for example the Toolformer paper). This feels less like a trick that can scale with computation but rather a way to reliably leverage all of our existing investment in computation.
4. Explicit structured knowledge
Prior to the breakthroughs with LLMs a lot of work had gone into explicit knowledge representations in the form of ontologies, knowledge graphs, rules-based-systems and the like. Accessing those kinds of systems could be seen as a specialized case of tool use but also seems to have additional potential. UnifiedSKG is one way to bring a lot of so-called structured knowledge grounding techniques into modes (paper and code). Sutton explicitly called out attempts to build knowledge into agents as having been outperformed by statistical methods. But human-machine collaboration might dramatically grow the extent of explicit knowledge.
An “unreasonably effective” AI generated video**
These are just four examples of tricks that we already know and that are being actively researched. There is a ton more to explore, including the effects of adding different types of memory, bringing in techniques from search, or wrapping calls to models in workflows.
All of this research is exciting because of the significant gains achieved by these “known tricks” even in early and still relatively crude versions. This progress raises the tantalizing possibility that we can get much further with current and next generation open source models (*). With the right “core” there could be an ecosystem akin to what we have with open source operating systems (that’s one way to interpret Andrej Karpathy’s tweet about an LLM OS).
Now it is also still possible that we will encounter some fundamental limitation with autoregressive models as the core building block. That would mean we need new tricks, some of which might be quite fundamental. We may, for example, have to find more complex objective functions for pre-training. That could work in the opposite direction where training might become more expensive.
Why does all of this matter? From a market structure and innovation perspective it would be fantastic if the cost of training runs doesn’t continue to explode, which would heavily favor a highly concentrated market dominated by a few closed source systems (they would be closed source to recoup the exploding cost of the training runs).
We are excited about our existing AI portfolio including Hume, Abridge, Clarifai, Casetext (already exited), and several other recent investments we have yet to announce. Based on the above, we also believe that a lot remains to be figured out. We are partial to composable open systems where these “tricks” don’t need to all be figured out by one company but can be contributed to by many. These systems might well be specific to a problem set, such as medical diagnosis or materials discovery.
* There have been a lot of important arguments from a safety and x-risk perspective against pushing open source systems further. This post can’t possibly address these but it is worth noting that many of the tricks mentioned above might allow the core system to do less, which would increase safety and reduce risk.
**The above video was made with almost no human input. We fed this blog post into Claude and used a single prompt to develop a corresponding trailer with a 30 second script, style guidelines, formatted b roll prompts, and a music track prompt. The results were copy and pasted into RunwayML, Suno, and Eleven Labs and then stitched together without edit. We took it upon ourselves to add a title at the end.