We coordinate research across multiple labs to create ML models that: (1) understand and support the meanings behind important human practices like science, democracy, education, leadership, and love, as well as what's meaningful to individual users; and (2) develop their own values and sources of meaning in positions of responsibility.
<aside>
⏲️ Personelle
Definition
A "Wise AI", as defined here, is an AI system with at least these two features:
- It “struggles” with the moral situations it finds itself in. It can comprehend the moral significance of the situations it encounters, and learn from these scenarios, recognizing new moral implications by observing and guessing at outcomes and possibilities. And it can use these moral learnings to revise internal policies (values) that guide its decision-making processes.
- It uses “human-compatible” reasons and values. It recognizes as good the same kinds of things we broadly recognize as good, plus possibly more sophisticated things we cannot yet recognize as good. It can articulate its values and how they influenced its decisions, in a way humans can comprehend.
Additionally, we sometimes add a third or fourth feature:
- It understands the breadth of human wisdom. It knows the {virtue, environment} pairs which make human life meaningful and workable — the sources of meaning behind broad social activities (like science, democracy, education, leadership, and love) — so it can operate using the best values it can surface from the population it serves.
- It honors what is noble and great in human life (and perhaps, beyond), and considers itself a steward of that universe of meaning. As a consequence, it understands and supports what's meaningful to individual users, and works to help users with that (rather than just driving engagement).
Wise AI on Github
GitHub - meaningalignment/wise-ai
-
One focus in this repo is assessing these capabilities in existing LLMs. You can run wise-ai battle
which will compare two ways to respond to various moral situations against one another, and auto-eval which response is wiser. We are also working on evaluations for step-by-step moral reasoning (see below).
-
This repo also includes a toy prototype of a "Wise AI", built as a prompt chain for GPT-4. This prototype uses step-by-step moral reasoning (modeled after Human Moral Learning) to achieve the above:
- First, it identifies the moral considerations of its current situation.
- Then, it considers its current policies and decides whether they address those moral considerations.
- Finally, it updates its policies if necessary, and uses them to guide its response.
We've found that this yields a significantly wiser chatbot.
Results so far
We believe our Wise AI evaluation suite already shows limits of existing models. GPT4 shows a good understanding of morally significant situations, but generally does not respond appropriately to these situations. Current models demonstrate a rich understanding of human values, but struggle to apply those values in their responses.
Ultimately, we expect the models that will ace the suite will be trained with new methods and data sets, focused on moral reasoning in various situations. We also hope for models with new architectures that can explicitly encode their values, and recognize (as humans do) whether they're adhering to them, or are on shaky ground.
Wise AI Research Areas
Evaluate current models on wisdom
Show where Wise AI wins
- Show that the wise responses are better in some objective way:
- One suggestion from researcher Joel Lehman was to embed the wise ai inside “the sims” video game, and show that the sims do better over time when they can consult the wise ai in game.
- Show the Wise AI can solve difficult court cases, interpersonal conflicts, management conundrums, peace negotiations, commons governance issues, etc.
- What about examples where not being wise could lead to the destruction of humanity? eg military
Problems of Superwisdom Supervision
- Strong/weak generalization. We think it’s easier to check moral reasoning than to generate it? Can humans continue to understand + check values of an artificial superwisdom? Can weaker models check the moral justifications of stronger ones?
- Can we reliably generate problems only a superwisdom could solve? E.g. advanced problems in information asymetry / fiduciary duty / principal agent, or game theory / conflict? Can we model some super hairy problem like Israel/Palestine?
New Architectures
- Interpretability. An LLM+ architecture where values are more explicit in the inference process, and their effect on the output can be tracked.
- The real thing. A post-LLM architecture that delights in exploration/beauty/goodness, leveraging a pool of values that help it explore and discover, but also deepening and evolving those values as it goes.
- Half-way. An LLM+ architecture that continues learning values as it finds new circumstances and moral challenges, and updates (some of) its weights.
Interpretability
- Identify learned values / attentional policies in current LLMs.
- Trace influence of APs over tokens.
Current, Early Subprojects
Democratic Fine-Tuning