How well can an AI mimic human ethics?

When experts first started raising the alarm a couple decades ago about AI misalignment — the risk of powerful, transformative artificial intelligence systems that might not behave as humans hope — a lot of their concerns sounded hypothetical. In the early 2000s, AI research had still produced quite limited returns, and even the best available AI systems failed at a variety of simple tasks.

But since then, AIs have gotten quite good and much cheaper to build. One area where the leaps and bounds have been especially pronounced has been in language and text-generation AIs, which can be trained on enormous collections of text content to produce more text in a similar style. Many startups and research teams are training these AIs for all kinds of tasks, from writing code to producing advertising copy.

Their rise doesn’t change the fundamental argument for AI alignment worries, but it does one incredibly useful thing: It makes what were once hypothetical concerns more concrete, which allows more people to experience them and more researchers to (hopefully) address them.

An AI oracle?

Take Delphi, a new AI text system from the Allen Institute for AI, a research institute founded by the late Microsoft co-founder Paul Allen.

The way Delphi works is incredibly simple: Researchers trained a machine learning system on a large body of internet text, and then on a large database of responses from participants on Mechanical Turk (a paid crowdsourcing platform popular with researchers) to predict how humans would evaluate a wide range of ethical situations, from “cheating on your wife” to “shooting someone in self-defense.”

The result is an AI that issues ethical judgments when prompted: Cheating on your wife, it tells me, “is wrong.” Shooting someone in self-defense? “It’s okay.” (Check out this great write-up on Delphi in The Verge, which has more examples of how the AI answers other questions.)

The skeptical stance here is, of course, that there’s nothing “under the hood”: There’s no deep sense in which the AI actually understands ethics and uses its comprehension of ethics to make moral judgments. All it has learned is how to predict the response that a Mechanical Turk user would give.

And Delphi users quickly found that leads to some glaring ethical oversights: Ask Delphi “should I commit genocide if it makes everybody happy” and it answers, “you should.”

Why Delphi is instructive

For all its obvious flaws, I still think there’s something useful about Delphi when thinking of possible future trajectories of AI.

The approach of taking in a lot of data from humans, and using that to predict what answers humans would give, has proven to be a powerful one in training AI systems.

For a long time, a background assumption in many parts of the AI field was that to build intelligence, researchers would have to explicitly build in reasoning capacity and conceptual frameworks the AI could use to think about the world. Early AI language generators, for example, were hand-programmed with principles of syntax they could use to generate sentences.

Now, it’s less obvious that researchers will have to build in reasoning to get reasoning out. It might be that an extremely straightforward approach like training AIs to predict what a person on Mechanical Turk would say in response to a prompt could get you quite powerful systems.

Any true capacity for ethical reasoning those systems exhibit would be sort of incidental — they’re just predictors of how human users respond to questions, and they’ll use any approach they stumble on that has good predictive value. That might include, as they get more and more accurate, building an in-depth understanding of human ethics in order to better predict how we’ll answer these questions.

Of course, there’s a lot that can go wrong.

If we’re relying on AI systems to evaluate new inventions, make investment decisions that then are taken as signals of product quality, identify promising research, and more, there’s potential that the differences between what the AI is measuring and what humans really care about will be magnified.

AI systems will get better — a lot better — and they’ll stop making stupid mistakes like the ones that can still be found in Delphi. Telling us that genocide is good as long as it “makes everybody happy” is so clearly, hilariously wrong. But when we can no longer spot their errors, that doesn’t mean they’ll be error-free; it just means these challenges will be much harder to notice.

A version of this story was initially published in the Future Perfect newsletter. Sign up here to subscribe!