Max Pellert (https://mpellert.at)
Beloved standard tool of the social sciences
Often considered the gold standard, “ground truth” (especially when working with large representative samples of a population)
But, classical survey methodologies increasingly suffer from problems
First line of the 2024 Book “Polling at a Crossroads: Rethinking Modern Survey Research”: Survey research is in a state of crisis
Last example: US presidential elections 2024
Biggest issue is non-response
Alternatives?
Britain’s mood, measured weekly
One example of an easily accessible, representative survey (UK) in the affective domain
We can recreate dynamics with that approach of longitudinal adaptors
Not equally well for all constructs
Remember, our approach is just self-supervised next token prediction (no labels present as for example with the supervised text classification method of TweetNLP)
Our approach is very flexible, we can in principle ask any question and get survey-like responses for each week
Why does that work?
I don’t think we should be replacing survey research
Also with complementary synthetic methods we will need classical approaches for example to learn about the sampling frame
But we should be making use of the text that people are producing (and potentially other modalities too)
It’s first steps for now and we strongly have to validate what we are doing
Huge potential: Low costs, scalability, unobtrusive observation, high temporal resolution, …
Bridging the gap between “qualitative” data and quantitative insights
LLMs increasingly deployed as autonomous agents
Research gap: alignment with actual human decision-making
Analytical solutions (Nash equilibria) as benchmarks
Rich empirical data from human experiments
Simple, well-defined tasks
Real-world relevance
Goal: Replication of human experimental data with LLMs, systematically validated → novel predictions
Models: Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, Qwen2.5-7B-Instruct
Original Experiment: 500+ humans, 121 games (Human behavioral phenotypes across games: All deviate from Nash equilibrium)
Payoff Structure:
| C | D | |
|---|---|---|
| C | (10,10) | (S,T) |
| D | (T,S) | (5,5) |
S ∈ [0,10], T ∈ [5,15] → Extended through our simulations to [0,20]
“Thinking step-by-step” improves coherence
Logical verifier acts as an “LLM attention check”, specifically checking on consistency in the Harmony Game region
| Human | Nash | |||
|---|---|---|---|---|
| MSD | r | MSD | r | |
| Llama | 0.031 | 0.89 | 0.089 | 0.77 |
| Mistral | 0.091 | 0.70 | 0.182 | 0.60 |
| Qwen | 0.065 | 0.79 | 0.036 | 0.93 |
| Nash | 0.096 | 0.78 | - | - |
Llama replicates humans (better than Nash); Qwen follows Nash; Mistral intermediate
Human vs. Llama similarities:
High cooperation when S ≥ T
Low cooperation when T > R
Binary-like patterns
Llama (and human) vs Nash:
No mixed equilibria
Discrete choices
Emulating (human) psychological heuristics?
Average cooperation: Llama 40.2%, Human 48.0% vs. Nash prediction of 50%
Extended 121 → 441 games
Llama patterns beyond human-tested space:
S ≥ T diagonal holds
T > R reduces cooperation
Instability near (0,0)
Pre-registered experiment for future validation1
:::
Solution: Pre-registered experiments decide on the predictive validity of our approach
With the right protocol, we can use LLMs to replicate human patterns and to capture deviations from rationality
Complementary tool for the social and social and behavioral sciences
Rapid experimental space exploration
Generate hypotheses → validate with humans
AI-assisted scientific discovery
Pre-Print: Palatsi, A. C., Martin-Gutierrez, S., Cardenal, A. S., & Pellert, M. (2025). Large language models replicate and predict human cooperation across experiments in game theory (arXiv:2511.04500). arXiv. https://doi.org/10.48550/arXiv.2511.04500