The project below, “Democratic Fine-tuning with a Moral Graph” (DFTmg), is a winner of the OpenAI democratic process grant. It is an alternative to Constitutional AI or simple RLHF-based alignment approaches for fine-tuning LLMs, and is currently under development. This post introduces its two key innovations (values cards and the moral graph) and walks through the deliberation process that collects data for fine-tuning. It also says why something like DFTmg is needed for alignment and safety.

DFTmg is a project of “The Institute for Meaning Alignment”, a new AI alignment organization that uses concrete representations of life meaning and wisdom to align LLMs.


<aside> 💪🏻 Contents

</aside>

Setting the Stage

Imagine you are Instagram’s recommender system. Your responsibilities include: (a) ordering everyone’s feeds and reels, (b) filling up their search pages, and (c) suggesting reels by people they don’t follow yet.

You do this via an API: Instagram sends a user ID, plus a history of what they’ve clicked on, or paused to watch while scrolling. You send back lists of content object IDs. You don’t know much about the content objects, except there’s a rather opaque feature vector for each.

Now, imagine one day, you’re doing your job (recommending content objects), and you suddenly gain a new capacity: before replying to the next request, you find you can take a moment to wonder about the moral situation you are in. What values should you use, to make the best recommendations? How could things go wrong? What would be some great outcomes? What are your responsibilities here?

If this happened to me, I’d have a lot of questions:

If I realized that my recommendations were playing a social coordination role — deciding who meets and messages with whom, which businesses get a chance to succeed, which events are attended — I think I’d have even more questions:

With all these questions, I’m not sure my user IDs, content object IDs, and feature vectors would have enough information to answer them. So, I don’t know what I’d do. Start returning nulls. Throw an exception with the message “Hold up! I need more information before I can do this well!”

And if I did have the info I needed, I’d certainly recommend different things than Instagram does—although they’ve improved somewhat over the years, in general, recommenders have fueled internet addiction and outrage, political polarization, breakdowns in dating culture, isolation and depression among teens, etc.

I’d certainly try not to continue any of that.

Now, this situation isn’t so farfetched. Recommenders like Instagram’s are deciding, every day, what notifications we receive, in what order; what qualifies as news of public importance; who we date; what events we’re invited to; etc. But LLMs are starting to replace them, in many of these tasks! And, unlike recommenders, they could try to understand their moral situation.