Foundations of Cooperative AI Through an LLM Lens
A personal summary and reflection
Introduction
I have recently started the “Introduction to Cooperative AI” course from the Cooperative AI Foundation, and as part of the first week of readings I reread “Foundations of Cooperative AI” by Conitzer & Oesterheld. This is a paper that I have used as a basis for the ideas of my previous two posts about similarity based cooperation and program games. My goal was to continue this post to relate ideas about self-locating beliefs to LLMs but I couldn’t get many good ideas that weren’t fairly obvious or hadn’t been discussed in other works. Since, my initial reading of the post was through the lens of application to LLMs, reading the paper again from a more neutral perspective for the course helped me to see how I was tunnel visioned. From this, I decided to pivot from trying to force a post about self-locating beliefs for LLMs to discussing some of the overall insights I have gained from the process and the important ideas to take away from it. The rest of this post will be a reflection of this process, including some of my main takeaways and the errors that I had in my framing while reading the paper. I hope that this may be informative for people who are relatively new to cooperative AI (or AI safety more generally).
An Aside: as is always worth mentioning when discussing cooperative AI, these discussions of ways that models cooperate should not always be seen as things to strive towards. Cooperative capabilities are inherently dual-use as they can result in reducing conflict between interacting agents but can also allow for the organisation of models for misaligned actions like scheming, steganography or interactions that have negative externalities for humans. Thus, the following discussion should be seen as discussion of LLM-based agents moving from being less cooperative to more cooperative, not from less good to more good.)
Key points about the paper
For those who have not read the paper, “Foundations of Cooperative AI” by Conitzer and Oesterheld is a call to action for people to investigate some foundational ideas that might be relevant for understanding and promoting cooperation between AI agents, with some emphasis on ideas that would not be relevant to cooperation between humans. Importantly, this paper was not written with any reference to LLMs specifically.
My main takeaways from reading and deliberating about the paper are:
The possibility of two agents being copies of one another provide a basis for lots of interesting cooperative capabilities that humans cannot access. This is something that seems especially relevant for LLMs as any instance of a model can be considered a copy of another instance (potentially up to the context of the model). How this idea might actually affect cooperation is still unclear as it could play a role via decision theoretic reasoning as outlined in the paper or it could be relevant for identity-based cooperation. See my previous post “Agent Properties for Similarity-Based Cooperation” for more information.
Utilising code and model internals are another such important idea. Not just being able to write code, but being able to read other’s code, simulate other’s code, and edit one’s own code are all crucial features that can enable cooperation through understanding of opponent’s internals and allowing for gradual disarmament in one-shot scenarios, as is argued in the paper. I would argue that the most likely way that this manifests with LLMs is the use of LLMs to produce code that is then used for such purposes, as is discussed in my previous post “Program Games for LLMs”.
Some notion of persistent memory is important for many of the ideas in the paper, which is not currently found in LLMs, but this has simple workarounds like improvement of context window lengths and utilising tools to store extra information for later retrieval.
Some lessons
Here, I will now share my reflections on how I made some mistakes in my reading and application of the paper to LLMs in the hope that someone who is also interested in cooperation related to LLMs will find these ideas useful. I also don’t claim that these are novel insights into how to read papers or how to think about research or how to red-team ideas. These are just some insights that I think could be useful to consider going forward.
Considering the Effects of Scaffolding and Infrastructure
When considering applications with LLMs, it is important to consider how they might evolve in the future or have new scaffolding. An example of what I mean by an LLM-based agent with scaffolding is ouroboros, which is a self-modifying agent that uses LLMs as a base and a judge to create its own source code, which is the thing that is continuously updated. This agent would be more easily able to facilitate things like “reading each other’s code” or “disarmament” that might not be as obvious for just LLMs interacting via API. Thus, here the points of the black-box nature of LLMs would be irrelevant.
Not only can the scaffolding around the agent be different, but the infrastructure that is put in place in the environment that the agent interacts in can also be a big factor for cooperation. This approach, as a complement to the approach of ensuring that the individual model has internals that promote cooperation, can facilitate ideas like optimal equilibrium selection in a much clearer way by structuring the environment of the agent such that the equilibrium are clear and known. Some examples of such infrastructure are commitment devices and verification systems.
Read ideas out of a paper instead of into it.
Quite a simple point that is still easy to forget. I put the cart before the horse: I started with the idea of LLM cooperation and then decided to read that into the text to try to force some of the ideas to work with LLMs as is. This did work for some ideas, which I turned into my two previous posts, but there are limits to this approach.
To illustrate this, I was originally going to write this blog post about self-locating beliefs and was trying my hardest to say something substantive about its relevance to LLM based cooperation. While I do think that there is probably something interesting to say about how eval awareness or monitoring of LLMs could be framed as some variant of the sleeping beauty problem, which is the canonical example of self-locating beliefs, these are not particularly deep or useful insights.
Focusing too much on making things work for LLMs prevented me from engaging in ideas in a way that might be more useful. For example, with self-locating beliefs, the transfer over to LLMs is less of an interesting point, but the more interesting angle is to consider the real world safety cases in which this idea might be particularly relevant, such as in evaluation awareness, control paradigms such as untrusted monitoring.
Final Thoughts
Looking forward, I am personally most convinced by the ideas of studying levels of cooperation based on the identity of other models. When taking very idealised and game-theoretic ideas like this, there will always be something lost when trying to transfer the ideas over to LLMs, but I think there is still some value in the approximate directions that the work can provide.
