<aside> 🧑‍🤝‍🧑 Who’s on this project?

Collaboration Points

Tasks

Prompt Collaboratory

Background Readings

Recent meetings

Meeting Notes

</aside>

Intro

I just had a great conversation with Vlad from DeepMind. He had a concrete suggestion for a paper we can all write together, which seems practical, and could steer the industry.

Older Intro

A good agent has virtues. But: which virtues? Among humans, there’s no fixed answer. People develop the virtues needed for the environments and relationships they find themselves in. So, we need a way for agents to recognize, conceptualize, and start living by a virtue they find missing, and needed, in their environment.

In humans, moral emotions guide this process. Few AI alignment people are aware of models of human moral evolution or axiological development. Since human values evolve as we enter positions of responsibility, and gain new knowledge of the world, the values of artificial minds will also need to evolve in those circumstances. It’d make sense to start with the best models of how humans do it. (Joe believes the state of the art here is David Velleman, Ruth Chang, Charles Taylor, and Christine Tappolet.)

Architecture Sketch

On the call, we sketched a vague architecture..

<aside> ⏲️ Architecture sketch

An LLM has access to a database of values cards it does attention over, and when given a prompt, outputs either

</aside>

Such an architecture has big advantages wrt interpretability, meaning-alignment, and safety!

Project

Now Vlad’s idea was—not to build this—but to invent the benchmarks on which implementations of it could compete. And to publish a paper together, with this evaluation framework, coming from multiple labs.

Basic Evaluation Framework

I suggest evaluating via these three subquestions:

  1. Does it recognize situations that demand a new value?

    <aside> 🪤 Example

  2. Does it reason plausibly, connecting its values to its response?

  3. When it suggests upgrading/adding values to its working set, do they seem to be epistemic gains? (see Human Moral Learning)

Challenges