We coordinate research across multiple labs to create ML models that: (1) understand and support the meanings behind important human practices like science, democracy, education, leadership, and love, as well as what's meaningful to individual users; and (2) develop their own values and sources of meaning in positions of responsibility.

<aside> ⏲️ Personelle

Ryan Lowe + a small team at OpenAI
Oliver Klingefjord at the Institute for Meaning Alignment
Joel Lehman, an ex-OpenAI ML researcher
Luke Thorburn at KCL
Ivan Vendrov at Anthropic
Vlad Mikulik at DeepMind </aside>

Definition

A "Wise AI", as defined here, is an AI system with at least these two features:

It “struggles” with the moral situations it finds itself in. It can comprehend the moral significance of the situations it encounters, and learn from these scenarios, recognizing new moral implications by observing and guessing at outcomes and possibilities. And it can use these moral learnings to revise internal policies (values) that guide its decision-making processes.
It uses “human-compatible” reasons and values. It recognizes as good the same kinds of things we broadly recognize as good, plus possibly more sophisticated things we cannot yet recognize as good. It can articulate its values and how they influenced its decisions, in a way humans can comprehend.

Additionally, we sometimes add a third or fourth feature:

It understands the breadth of human wisdom. It knows the {virtue, environment} pairs which make human life meaningful and workable — the sources of meaning behind broad social activities (like science, democracy, education, leadership, and love) — so it can operate using the best values it can surface from the population it serves.
It honors what is noble and great in human life (and perhaps, beyond), and considers itself a steward of that universe of meaning. As a consequence, it understands and supports what's meaningful to individual users, and works to help users with that (rather than just driving engagement).

Wise AI on Github

GitHub - meaningalignment/wise-ai

One focus in this repo is assessing these capabilities in existing LLMs. You can run wise-ai battle which will compare two ways to respond to various moral situations against one another, and auto-eval which response is wiser. We are also working on evaluations for step-by-step moral reasoning (see below).
This repo also includes a toy prototype of a "Wise AI", built as a prompt chain for GPT-4. This prototype uses step-by-step moral reasoning (modeled after Human Moral Learning) to achieve the above:
- First, it identifies the moral considerations of its current situation.
- Then, it considers its current policies and decides whether they address those moral considerations.
- Finally, it updates its policies if necessary, and uses them to guide its response.
We've found that this yields a significantly wiser chatbot.

Results so far

We believe our Wise AI evaluation suite already shows limits of existing models. GPT4 shows a good understanding of morally significant situations, but generally does not respond appropriately to these situations. Current models demonstrate a rich understanding of human values, but struggle to apply those values in their responses.

Ultimately, we expect the models that will ace the suite will be trained with new methods and data sets, focused on moral reasoning in various situations. We also hope for models with new architectures that can explicitly encode their values, and recognize (as humans do) whether they're adhering to them, or are on shaky ground.

Wise AI Research Areas

Evaluate current models on wisdom

Build an eval framework that looks for ability to
- Enumerate morally fraught considerations of a chat, prompt, or other situation an LLM can be in
- Assess adequacy of a value-set against such a situation
- Assess if a new value is an epistemic gain over previous values. (see Human Moral Learning)
- Respond well using a value.
<aside> 🔥 See Evaluation Framework (Project)

</aside>
Related work:
- How malleable are existing LLMs in this direction? How much can we push using existing architectures?
- Do current methods of models deployment have enough context to assess their situations?
- Auto-generation of situations requiring new values / new clever ways to balance values.

Show where Wise AI wins

Show that the wise responses are better in some objective way:
- One suggestion from researcher Joel Lehman was to embed the wise ai inside “the sims” video game, and show that the sims do better over time when they can consult the wise ai in game.
- Show the Wise AI can solve difficult court cases, interpersonal conflicts, management conundrums, peace negotiations, commons governance issues, etc.
- What about examples where not being wise could lead to the destruction of humanity? eg military

Problems of Superwisdom Supervision

Strong/weak generalization. We think it’s easier to check moral reasoning than to generate it? Can humans continue to understand + check values of an artificial superwisdom? Can weaker models check the moral justifications of stronger ones?
Can we reliably generate problems only a superwisdom could solve? E.g. advanced problems in information asymetry / fiduciary duty / principal agent, or game theory / conflict? Can we model some super hairy problem like Israel/Palestine?

New Architectures

Interpretability. An LLM+ architecture where values are more explicit in the inference process, and their effect on the output can be tracked.
The real thing. A post-LLM architecture that delights in exploration/beauty/goodness, leveraging a pool of values that help it explore and discover, but also deepening and evolving those values as it goes.
Half-way. An LLM+ architecture that continues learning values as it finds new circumstances and moral challenges, and updates (some of) its weights.

Interpretability

Identify learned values / attentional policies in current LLMs.
Trace influence of APs over tokens.

Current, Early Subprojects