The project below, “Democratic Fine-tuning with a Moral Graph” (DFTmg), is a winner of the OpenAI democratic process grant. It is an alternative to Constitutional AI or simple RLHF-based alignment approaches for fine-tuning LLMs, and is currently under development. This post introduces its two key innovations (values cards and the moral graph) and walks through the deliberation process that collects data for fine-tuning. It also says why something like DFTmg is needed for alignment and safety.

DFTmg is a project of “The Institute for Meaning Alignment”, a new AI alignment organization that uses concrete representations of life meaning and wisdom to align LLMs.

<aside> 💪🏻 Contents

</aside>

Setting the Stage

Imagine you are Instagram’s recommender system. Your responsibilities include: (a) ordering everyone’s feeds and reels, (b) filling up their search pages, and (c) suggesting reels by people they don’t follow yet.

You do this via an API: Instagram sends a user ID, plus a history of what they’ve clicked on, or paused to watch while scrolling. You send back lists of content object IDs. You don’t know much about the content objects, except there’s a rather opaque feature vector for each.

Now, imagine one day, you’re doing your job (recommending content objects), and you suddenly gain a new capacity: before replying to the next request, you find you can take a moment to wonder about the moral situation you are in. What values should you use, to make the best recommendations? How could things go wrong? What would be some great outcomes? What are your responsibilities here?

If this happened to me, I’d have a lot of questions:

What are these content objects anyways? Do people really want to watch them, or are some of them clickbait?
With the lists of what people paused to watch, did they feel good about watching those things? Or, were they compelled by sexual imagery, false promises, etc? Do they regret pausing to watch?
Who are all these people? What are they looking for in life? What’s the deepest way I could help them, in my role?

If I realized that my recommendations were playing a social coordination role — deciding who meets and messages with whom, which businesses get a chance to succeed, which events are attended — I think I’d have even more questions:

What kind of relationships are needed? Which pairs of people can really help each other? What kinds of events and messages and encounters will strengthen society overall, make people less lonely, more empowered?
Which kinds of businesses and organizations should succeed? Are they the ones which will drive engagement? The ones acquiring followers the fastest? Can I tell from my feature vectors?

With all these questions, I’m not sure my user IDs, content object IDs, and feature vectors would have enough information to answer them. So, I don’t know what I’d do. Start returning nulls. Throw an exception with the message “Hold up! I need more information before I can do this well!”

And if I did have the info I needed, I’d certainly recommend different things than Instagram does—although they’ve improved somewhat over the years, in general, recommenders have fueled internet addiction and outrage, political polarization, breakdowns in dating culture, isolation and depression among teens, etc.

I’d certainly try not to continue any of that.

Now, this situation isn’t so farfetched. Recommenders like Instagram’s are deciding, every day, what notifications we receive, in what order; what qualifies as news of public importance; who we date; what events we’re invited to; etc. But LLMs are starting to replace them, in many of these tasks! And, unlike recommenders, they could try to understand their moral situation.