This repository implements an adaptive agent for Repeated Rock–Paper–Scissors (RRPS) that is designed to perform robustly against a diverse population of adversarial opponents under strict runtime constraints.
The agent combines history-based prediction, game-theoretic reasoning, and online learning to dynamically adapt its strategy without any offline training. In population-based evaluation against a benchmark set of adversarial bots, the agent achieved top-10 performance against 40+ opponents.
Repeated Rock–Paper–Scissors is a simple game with rich strategic structure when played repeatedly against adaptive opponents. Different adversaries exhibit different behaviors:
Some follow simple, fixed patterns
Others adapt strategically to exploit predictable opponents
Some play stochastically to avoid exploitation
A strong agent must therefore balance exploitation, adaptation, and defensive randomization, all while operating under tight per-move runtime constraints.
The agent combines three complementary components, selected online using a multi-armed bandit.
The agent maintains a variable-order Markov model over the joint action history. Instead of using a fixed-order model, it:
Searches for the longest matching history suffix seen before
Estimates the opponent’s next move based on empirical frequencies
Uses a maximum history length of 6 to balance expressiveness and efficiency
This allows the agent to quickly exploit opponents with repetitive or structured behavior.
A simple best response to a predicted opponent move (level-0 reasoning) is often exploitable by adaptive opponents. To address this, the agent implements iterative best-response reasoning:
It assumes the opponent may be modeling the agent’s behavior
It computes best responses to the opponent’s predicted best response
This process is repeated for multiple reasoning levels
Each reasoning depth corresponds to a distinct strategy, enabling the agent to operate effectively against opponents with different levels of adaptivity.
When all structured strategies perform poorly, the agent switches to a uniform random strategy, corresponding to the Nash equilibrium of Rock–Paper–Scissors.
This prevents further exploitation and ensures bounded losses against adversaries specifically designed to exploit deterministic or overly adaptive behavior.
All strategies (random + multiple best-response depths) are combined using a multi-armed bandit:
Each strategy maintains a weight reflecting recent performance
Weights are updated after every round, with decay to adapt to non-stationary opponents
Strategies that would have won are rewarded; losing strategies are penalized
The agent selects the strategy with the highest current weight
This allows the agent to dynamically adapt its reasoning depth and behavior to the current opponent.
Achieved a top-10 ranking in population-based evaluation against 40+ adversarial bots
Operates under strict per-move runtime constraints with no offline training
Demonstrates robust online adaptation: exploits simple patterned opponents, remains competitive against adaptive agents, and avoids catastrophic exploitation via random (Nash) fallback
Evaluation is performed using the OpenSpiel implementation of the population-based RRPS benchmark, which measures performance and robustness against a fixed population of hand-crafted adversarial bots.
Marc Lanctot et al., Population-based Evaluation in Repeated Rock–Paper–Scissors as a Benchmark for Multiagent Reinforcement Learning, Transactions on Machine Learning Research (TMLR), 2023
OpenSpiel: A Framework for Reinforcement Learning in Games, DeepMind