OpenAI was founded on a promise to build artificial intelligence that benefits all of humanity—even when that AI becomes considerably smarter than its creators. Since the debut of ChatGPT last year and during the company’s recent governance crisis, its commercial ambitions have been more prominent. Now, the company says a new research group working on wrangling the supersmart AIs of the future is starting to bear fruit.
“AGI is very fast approaching,” says Leopold Aschenbrenner, a researcher at OpenAI involved with the Superalignment research team established in July. “We’re gonna see superhuman models, they’re gonna have vast capabilities, and they could be very, very dangerous, and we don’t yet have the methods to control them.” OpenAI has said it will dedicate a fifth of its available computing power to the Superalignment project.
A research paper released by OpenAI today touts results from experiments designed to test a way to let an inferior AI model guide the behavior of a much smarter one without making it less smart. Although the technology involved is far from surpassing the flexibility of humans, the scenario was designed to stand in for a future time when humans must work with AI systems more intelligent than themselves.
OpenAI’s researchers examined the process, called supervision, which is used to tune systems like GPT-4, the large language model behind ChatGPT, to be more helpful and less harmful. Currently this involves humans giving the AI system feedback on which answers are good and which are bad. As AI advances, researchers are exploring how to automate this process to save time—but also because they think it may become impossible for humans to provide useful feedback as AI becomes more powerful.
In a control experiment using OpenAI’s GPT-2 text generator first released in 2019 to teach GPT-4, the more recent system became less capable and similar to the inferior system. The researchers tested two ideas for fixing this. One involved training progressively larger models to reduce the performance lost at each step. In the other, the team added an algorithmic tweak to GPT-4 that allowed the stronger model to follow the guidance of the weaker model without blunting its performance as much as would normally happen. This was more effective, although the researchers admit that these methods do not guarantee that the stronger model will behave perfectly, and they describe it as a starting point for further research.
“It’s great to see OpenAI proactively addressing the problem of controlling superhuman AIs,” says Dan Hendryks, director of the Center for AI Safety, a nonprofit in San Francisco dedicated to managing AI risks. “We’ll need many years of dedicated effort to meet this challenge.”
Aschenbrenner and two other members of the Superintelligence team who spoke to WIRED, Collin Burns and Pavel Izmailov, say they are encouraged by what they see as an important first step toward taming potential superhuman AIs. “Even though a sixth grader knows less math than a college math major, they can still convey what they want to achieve to the college student,” Izmailov says. “That’s kind of what we’re trying to achieve here.”
The Superalignment group is co-led by Ilya Sutskever, an OpenAI cofounder, chief scientist, and one of the board members who last month voted to fire CEO Sam Altman before recanting and threatening to quit if he wasn’t reinstated. Sutskever is a coauthor on the paper released today, but OpenAI declined to make him available to discuss the project.
After Altman returned to OpenAI last month in an agreement that saw most of the board step down, Sutskever’s future at the company seemed uncertain.
“We’re very grateful to Ilya,” Aschenbrenner says. “He’s been a huge motivation and driving force,” on the project.
OpenAI’s researchers aren’t the first to attempt to use AI technology of today to test techniques that could help tame the AI systems of tomorrow. Like previous work in corporate and academic labs, it’s impossible to know whether ideas that work in a carefully designed experiment will be practical in future. The researchers describe the ability to have a weaker AI model train a stronger one that they are trying to perfect as “a key building block for the broader problem of superalignment.”
Experiments in so-called AI alignment also raise questions about how trustworthy any control system can be. The heart of the new OpenAI techniques depend on the more powerful AI system deciding for itself what guidance from the weaker system can be ignored, a call that could see it tune out information that would prevent it behaving in an unsafe manner in future. For such a system to be useful, progress will be needed in providing guarantees about alignment. “You’ll ultimately need a very high degree of trust,” says Burns, the third OpenAI team member.
Stuart Russell, a professor at UC Berkeley who works on AI safety, says the idea of using a less powerful AI model to control a more powerful one has been around for a while. He also says it is unclear that the methods that currently exist for teaching AI to behave are the way forward, because they have so far failed to make current models behave reliably.
Although OpenAI is touting a first step towards controlling more advanced AI, the company is also keen to enlist outside help. The company announced today that it will offer $10 million in grants in partnership with Eric Schmidt, the influential investor and former CEO of Google, to outside researchers who come up with further advances on topics including weak-to-strong supervision, interpretability of advanced models, and strengthening models against prompts designed to break their restrictions. OpenAI will also hold a conference next year on superalignment, the researchers involved with the new paper say.
Sutskever, the OpenAI cofounder and co-lead of the Superalignment team, has led much of the company’s most important technical work and is among the prominent AI figures increasingly worried about how to control AI as it becomes more powerful. The question of how to control future AI technology has gained new attention this year, in large part thanks to ChatGPT. Sutskever studied for his PhD under Geoffrey Hinton, a pioneer of deep neural networks who left Google in May of this year in order to warn about the pace at which AI now appears to be approaching human levels in some tasks.
Ilya Sutskever, the co-founder and Chief Scientist of OpenAI, recently proposed a strategy to safeguard super-intelligent artificial intelligence (AI). As AI continues to advance at an unprecedented pace, concerns about the potential risks associated with super-intelligent AI have become more prominent. Sutskever’s proposal aims to address these concerns and ensure that AI systems are developed and deployed in a manner that prioritizes safety and human well-being.
Super-intelligent AI refers to AI systems that surpass human intelligence across a wide range of cognitive tasks. While this level of AI capability holds immense potential for solving complex problems and driving innovation, it also raises concerns about the possibility of AI systems acting in ways that are detrimental to humanity. Ensuring the safe development and deployment of super-intelligent AI is crucial to prevent any unintended consequences that could arise from its immense power.
Sutskever’s proposed strategy focuses on two key aspects: value alignment and capability control. Value alignment refers to the idea that the goals and values of super-intelligent AI should align with those of human society. This means that AI systems should be designed to prioritize human well-being, ethical considerations, and respect for human values. By ensuring that AI systems share our values, we can mitigate the risk of them acting in ways that are harmful or contrary to human interests.
Capability control, on the other hand, involves mechanisms to ensure that super-intelligent AI remains under human control. This includes designing AI systems with built-in limitations and safeguards that prevent them from taking actions beyond what is intended or desirable. It also involves developing methods for humans to intervene and override AI decisions when necessary. By maintaining control over super-intelligent AI, we can prevent any potential scenarios where AI systems may act autonomously in ways that could be detrimental or dangerous.
Sutskever’s proposal emphasizes the importance of conducting research and development in a manner that prioritizes safety. He suggests that organizations working on super-intelligent AI should actively cooperate and share knowledge to collectively address safety concerns. This collaborative approach can help accelerate progress in developing robust safety measures and ensure that the risks associated with super-intelligent AI are effectively mitigated.
Furthermore, Sutskever highlights the need for policymakers and regulators to play a proactive role in shaping the development and deployment of super-intelligent AI. Establishing clear guidelines, regulations, and ethical frameworks can provide a framework for responsible AI development and ensure that AI systems are aligned with societal values and interests.
While Sutskever’s proposal offers valuable insights into safeguarding super-intelligent AI, it is important to acknowledge that addressing the risks associated with AI requires a multi-faceted approach. It involves collaboration between researchers, policymakers, industry leaders, and society as a whole. By actively engaging in discussions and implementing safety measures, we can harness the potential of super-intelligent AI while minimizing any potential risks.
In conclusion, Ilya Sutskever’s proposal to safeguard super-intelligent AI provides a valuable framework for ensuring the safe development and deployment of advanced AI systems. By focusing on value alignment and capability control, we can mitigate the risks associated with super-intelligent AI and ensure that it remains a powerful tool for the benefit of humanity. However, it is crucial for all stakeholders to actively participate in these efforts to collectively shape the future of AI in a responsible and safe manner.