When will LLMs have their AlphaZero moment?
Thoughts on the next chapter of LLMs
A long time ago, when I first saw the AlphaGo documentary I was blown away by how fast DeepMind went from AlphaGo to AlphaZero. As impressive as AlphaGo was, it was still trained on human played games and it would evaluate itself against whether it thought a professional human would make a play it was thinking of. This did lead to great, actually competitive matches with Lee Sedol, but AlphaGo lost a game. AlphaZero, however, was built from scratch with just the rules of the game and an ability to play itself and learn from it. To get to capacities humans just can’t.
With LLMs all being trained on a massive corpus of human written content, and adding the fact that they are literally prediction machines built to please, I can’t help but draw parallels between AlphaGo and AlphaZero. I know they are not the same technology, but what would entail “giving the rules of the game” to an LLM to train it? How would one even evaluate performance and be able to retrain based on these results? Is RL after a model is already trained the only way to go or can we train a model to learn to speak and create thoughts by just teaching it the language and its rules as opposed to showing examples?
I am really curious to see if LLMs get their AlphaZero moment, that will be the moment to watch out for.