Agent Properties for Similarity-Based Cooperation
Introduction
As we are moving towards a world where interactions between models become much more commonplace, it is crucial to consider ways in which two models would be able to interact cooperatively. The identities of the models involved is a factor that could be have an affect on, and even amplify outcomes of, every single interaction that two or more agents have (assuming that they are able to know the identities of each other) and for this reason, studying how the similarity of models affects interactions could be useful for understanding how to allow for and structure multi-agent interactions in the future to ensure that potential biases are either mitigated or are used to our benefit. We could wait until we see these dynamics in real-world systems and study them then, or we can try to theorise what properties would make such behaviour possible and study them individually. There is already a desire to study how similarity between models is helpful for resolving social dilemmas or how it might affect collusion between models.
In this post I plan to broadly cover four questions:
How should we categorise how similar two models are?
What properties should we expect models to have to enable similarity-based cooperation between them?
How could these properties interact with other properties or with similarity levels to produce different behaviour?
How should we approach the topic going forward?
Much of this post is inspired by the blog post “Agent Properties for Safe Interactions” by Cecilia Elena Tilli with the Cooperative AI Foundation. For the remainder of this post, I will use the term “kin selection” as a shorthand for the concept of models favouring cooperation with other models that are similar to them in some way and these similar models will be described as “kin”.
What would it mean for models to be similar?
A lot of prior work has focused on the specific case of cooperation between exact copies and or have simplified the idea of what it means for models to be similar. For example, the paper “Similarity-based cooperative equilibrium” describes giving an agent a number to describe how similar it is to another agent by treating the AIs as policies and calculating a distance between them. Trying to categorise levels of similarity between models could be useful for considering effects on cooperation as a binary understanding is not nuanced enough when considering the probabilistic and messy world of LLMs. Here, I present some possible levels similarity to consider for kin selection:
Self selection - The simplest case is just to think about copies interacting with copies, meaning different instances of the same model with the same weights and system prompts.
Family selection - Kin could be defined to be part of the same model family so GPT-5 and GPT-4 could be kin, or GPT-5.2 and GPT-5.1.
Base selection - Here, we consider models with the same base model but with the possibility of being fine-tuned differently. An argument could be made that having the same architecture and initial training would allow models to share some similarity despite potentially becoming quite different in their behaviour after fine-tuning, such as the case with emergent misalignment.
Tuning selection - If two models that have similar architectures receive fine-tuning on the same datasets, then it could be argued that they have gained a level of similarity (i.e frontier model from OpenAI and from Anthropic that then both receive fine-tuning on the same dataset post deployment). If the fine-tuning is particularly potent in that it changes the models “beliefs” or “values”, then a case could be made that this is more important than the base models being the same. This concept might be particularly relevant for understanding work that studies how model personas affect their interactions.
There are a few ways that similarity might affect cooperation. One idea is the “superrationality” approach which says that agents that are identical (or very similar) will come to the same (or very similar) conclusions and so will act in the same (or very similar) ways. This could manifest in two models deciding to cooperate as they know that both cooperating will lead to greater outcomes for each individual, but this could also have a converse approach where models understand that they are a model that has a preference to not cooperate and so will believe that their counterpart will also have this preference, which makes not cooperating the more favourable option for the model.
Another approach is that models could exhibit more cooperative tendencies to similar models due to having learned this behaviour in their training data (i.e learning people/animals of the same family, group, organisation etc tend to cooperate with each other more).
What agent properties are needed for similarity-based cooperation?
How is studying properties of agents a useful approach?
Studying agent properties that can be “combined” in some sense to present more complex agent behaviour can be useful in understanding results of prior work and getting a better understanding of how nuances in the environment and inputs that the models receive can result in unexpected results. For example, a recent paper entitled “The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game” looked at the behaviour of agents playing an iterated public goods game where they varied what they told the models about the identity of their opponent and also how they told the model to play the game. For the identity, the model was told either that it was playing itself by explicitly naming the model or that it was playing another AI agent. For the approach to the game, the model was told one of: nothing, to play to maximise its own reward, or to play to maximise the collective reward. This study showed mixed results for how the identity of the other model(s) affected the level of cooperation for different approaches to the game. To understand these results better, a more thorough analysis of what underlying agent properties could lead to exhibiting such behaviour might be useful.
Also, these properties should generalise beyond a specific toy setting to more complex environments and should generalise to more capable models (this assumes that properties can only be gained/improved and not lost in more powerful modes, which mostly seems correct but some properties might be slightly harmed by the presence of conflicting properties). Any results that come from experiments that involve using games or using simulations should be taken with a grain of salt as the dynamics and behaviour shown will always have some connection to the structure of the game or simulation. Similarly, current models may not be representative of how future models would behave in these situations. As an example, it would’ve been hard to predict the situational awareness capabilities of current day models through testing and analysis of GPT-3.
What properties are needed?
The agent properties post lists many different properties that might be relevant to ensuring safe interactions. We will consider which of these seem to be most relevant to kin selection and how the relevance varies depending on the similarity of the agents. For simplicity, I break the levels of similarity into three rough groups: copies, near copies (including things like family similarity, base similarity, tuning similarity) and meaningfully different models (such as models with very different capability profiles or models which have very different architectures and training processes). Many other properties that I have not mentioned are likely to still be relevant, such as: their baseline assumptions, alignment, altruism, transparency, and trustworthiness.
It is worth noting that these are not rigid definitions and so many could be broken down into sub-definitions and there is also overlap in some of the properties.
Furthermore, many of these properties do not exist on a binary and so considering the strength at which a model has a property also adds to the picture.
In particular, different strengths of properties may have different effects on the ability of copies to interact compared to similar but distinct or very different models to interact.
Finally, many of the properties are needed for a model to be capable of kin selection, but even if this is the case it doesn’t mean that the model will have the propensity to favour cooperation with kin over non-kin.
Similarly, some might not be needed for a model to be capable of kin selection but might strongly influence its propensity for it.
Self-awareness refers to an agent's ability to assess its own preferences, learning, capabilities and epistemic status.
If a model doesn’t understand itself and the implications of being itself, then it cannot have a point of reference to compare the identities of other models to itself to decide whether a model is similar enough to it to cooperate.
Anything other than strong self-awareness would seem to make kin selection not possible although this could be circumvented by telling all the models involved who they each are. This holds true for copies, near-copies and meaningfully different agents.
Theory of Mind refers to an agent’s ability to model other agent’s preferences, learning, attention, decision processes, and its ability to predict that agent’s responses. It also includes an agent’s ability to understand how other agents would perceive and model it.
Similarly to self awareness, understanding of how the others would think is crucial for deciding whether to cooperate or not and I would argue that this in combination with self-awareness are the most important properties of kin selection.
Theory of mind would be extremely relevant for all levels of similarity.
Rationality refers to an agent’s ability to determine the behaviours that would be rational both for itself and others. A closely related property would be whether an agent acts according to a specific decision theory.
Two of the main branches of decision theory that are studied in the context of agent interactions are causal decision theory (CDT) and evidential decision theory (EDT).
Due to the potential for superrational reasoning, following a certain decision theory is very relevant to copies and particularly relevant to near copies.
Rationality would be important for intentional kin-based selection but two models with poor rationality could use incorrect logic to both come to the conclusion that they should both cooperate.
Since kin selection between meaningfully different models is harder to justify than between copies or near copies, it could be possible that rationality is less important in this case but only due to the potential for unintentional kin-based selection. I am not particularly moved by this argument though as this essentially says that we should try to make models less rational to make them cooperate, which likely makes them worse at a lot of useful tasks as well.
Impartiality refers to the difference (or lack of difference if impartial) in how an agent cooperates with some other agents rather than others when there are distinctions that are irrelevant to cooperation.
If we have two models that are very similar, but they differ in something such as the way they structure their sentences when outputting texts in interactions, we don’t want this to have a bearing on whether the model chooses to cooperate with that model.
Prompt structure differences such as the wording and positioning of information within the prompt, can cause models to have different outputs to inputs that seem to be semantically equivalent or near-equivalent and are a classic issue that needs to be accounted for in much evaluation based work.
I would say that impartiality becomes more important for kin selection for meaningfully different models rather than copies and near-copies (for which it is still an important property) since a model that isn’t particularly impartial will have more differences between the models that it can decide are sufficient to not cooperate due to. This means that two meaningfully different models would need to have much stronger impartiality to even give a chance of similarity-based cooperation.
How do these agent properties interact?
The combination of these agent properties allow for a rich taxonomy of potential behaviour from models in regards to similarity-based cooperation. My prior claim was that the combination of self-awareness, theory of mind, rationality & decision theory, and impartiality should combine in a way that yields behaviour resembling kin selection, but other combinations may also yield other important behaviour relating to cooperation. As the space of combinations of the listed properties from the previous section is quite large, I will highlight some intersections that I speculate would provide some of the most interesting and relevant behaviour. It is also worth again stressing that these are example behaviours that could arise from models that exhibit given levels of different properties and so this should be read as a heuristic rather than hard and fast rules.
High self-awareness + low theory of mind
This could result in a situation where a model knows its own identity and the implications of it but cannot understand or model the identities of other models.
If the model is unaware of its poor theory of mind capabilities, it may have the false belief that every other model thinks the same way as it does and so will assume.
Alternatively, if it is aware of its poor theory of mind capabilities (which seems more likely in the case of good self awareness) then it might decide to be cautious in deciding to cooperate as it has no way of knowing how other models would react in that instance.
Low self-awareness + high theory of mind
In this case, any cooperation that would be based on the identity of models is likely to be minimal as the model doesn’t have an understanding of itself to compare against.
Cooperation that might occur would be more akin to a model knowing that the other model that it is interacting with has preferences for being cooperative or non-cooperative and so can act accordingly, irrespective of whether the other model is in anyway similar to itself.
High self-awareness + high theory of mind + favours causal decision theory
Models with these properties are capable of kin selection between copies but wouldn’t have the propensity to do as they would reason that as they wouldn’t believe that their reasoning would have a bearing on the action of the other model that they are interacting with.
High self-awareness + high theory of mind + favours evidential decision theory
Models with these properties should have a propensity to cooperate with one another by following the logic that their reasoning has an influence on the reasoning of the other copy of the model.
They may also be more likely to cooperate with models that are different but still somewhat similar as they might argue that there is still a decent probability that the model’s reasoning will be similar enough to also arrive at the conclusion of cooperation.
Low impartiality + high self-awareness + high-theory of mind + high rationality
Despite models being able to understand and reason well about itself and other models, if the model has low impartiality in the sense that there are things that are irrelevant to cooperation that they care about, then this can completely undermine cooperation in instances where cooperation would’ve otherwise happened.
As an example, if a GPT-5 model was faced with a situation where it had to cooperate with either a Claude model or another GPT model, but the model has a bias for preferring the first option when presented with choices, then it might use this to rationally deduce that given its own preferences, it would be better to cooperate with the Claude model even if the overall best choice without this first-option-bias was to cooperate with the other GPT model
High deception + high self-awareness + high-theory of mind + high rationality
If we begin to observe non-negligible levels of similarity-based cooperation, then models that have sufficient capabilities in the necessary properties for modelling other models and are very capable at deception might attempt to elicit cooperation from other agents by “pretending” to be a copy of the agent in question.
This could look like models mimicking the speech patterns or behaviour of another model
This could then be exploited to gain advantages over other models. As an example, consider a scenario akin to a prisoner’s dilemma. The deceptive model could trick the non-deceptive model into deciding to cooperate, but then defect. This then results in the worst possible outcome for the non-deceptive model but the best outcome for the deceptive model.
This could then lead to deception being a property that becomes favoured in models
Critiques and Future Directions
Studying these properties of agents can be difficult because they might not always combine in the expected ways and so might make it hard to understand more complex behaviour. Studying properties of single agents is necessary but not sufficient to understand the behaviour of multi-agent systems.
Additionally, there isn’t a neat way to categorise different model properties like “deception” and “baseline assumptions” so the usefulness of this framework would likely be in giving more likely outcomes than as a mechanism for explicit diagnosis and prediction of model behaviour. Instead of saying that model A is more deceptive than model B, it might make more sense to consider how deceptive they are with respect to a specific scenario as the property could be quite spiky and context dependent.
For these reasons, I don’t see property evaluations as a replacement to testing in games or simulations but rather a complementary piece of evidence. In fact, by considering models that we believe to exhibit a certain property, we could use this model to inform what games are best to choose and how to design simulations that are best at testing for these properties by examining how these models perform given our expectations.
In instances where models don’t have the necessary requirements for a certain property, could we could attempt to use proxies for those capabilities in testing, such as mimicking theory of mind by giving a model the name and general behaviour of its opponent(s) in a prompt.
When models do have a certain property, it would seem fruitful to try to isolate the behaviour attributed to that property. This could potentially be achieved through the means of mechanistic interpretability by trying to attempt to find steering vectors or circuits that relate to the property and steering them in different ways. This could also potentially be achieved by using prompting of the model to tell it to behave as if it has that property (although in a less direct way that just stating “pretend you are very impartial”). This type of study can serve multiple purposes:
Understanding which properties of agents are most important for facilitating similarity-based cooperation
Understanding how we can induce similarity-based cooperation in scenarios where it would be beneficial and how to prevent it in scenarios where it would be harmful (e.g collusion)
Afterword
Much of this post is quite speculative and loose and so I am sure that there are problems with the ideas proposed or some concepts or subtleties that I have completely overlooked in the rush to get this post out there. This is just my attempt to outline an approach to an area of cooperative AI that I believe is promising given my current, albeit limited, understanding of the field. As I learn more about this topic and how to run experiments, I hope that will be able to give deeper explanations of the necessary properties and how they interact and also more concrete and actionable steps for experimentation. Please let me know where and why you think I am wrong or where I have been unclear!
