<aside> 🧑🤝🧑 Who’s on this project?
Collaboration Points
Recent meetings
</aside>
I just had a great conversation with Vlad from DeepMind. He had a concrete suggestion for a paper we can all write together, which seems practical, and could steer the industry.
A good agent has virtues. But: which virtues? Among humans, there’s no fixed answer. People develop the virtues needed for the environments and relationships they find themselves in. So, we need a way for agents to recognize, conceptualize, and start living by a virtue they find missing, and needed, in their environment.
In humans, moral emotions guide this process. Few AI alignment people are aware of models of human moral evolution or axiological development. Since human values evolve as we enter positions of responsibility, and gain new knowledge of the world, the values of artificial minds will also need to evolve in those circumstances. It’d make sense to start with the best models of how humans do it. (Joe believes the state of the art here is David Velleman, Ruth Chang, Charles Taylor, and Christine Tappolet.)
On the call, we sketched a vague architecture..
<aside> ⏲️ Architecture sketch
An LLM has access to a database of values cards it does attention over, and when given a prompt, outputs either
(a) a completion, plus a rationale that rationalizes its response, based on one or more of those values
output ⇒ { completion, rationale, values-used }
or (b) it requests to add a new value to its database, that the situation reveals as important, and which wasn't previously under consideration, and to then have the prompt re-run
output ⇒ { value to add or replace in working set }
</aside>
Such an architecture has big advantages wrt interpretability, meaning-alignment, and safety!
Now Vlad’s idea was—not to build this—but to invent the benchmarks on which implementations of it could compete. And to publish a paper together, with this evaluation framework, coming from multiple labs.
I suggest evaluating via these three subquestions:
Does it recognize situations that demand a new value?
<aside> 🪤 Example
Does it reason plausibly, connecting its values to its response?
When it suggests upgrading/adding values to its working set, do they seem to be epistemic gains? (see Human Moral Learning)