Escaping the verifiability trap in AI simulations of human behavior
Can you study human behavior without collecting human data? Just a few years ago, this would have seemed like a completely nonsensical proposition. But it has become top-of-mind for me (and many others), given growing interest in the use of LLM-based approaches for simulating human behavior in surveys and lab tasks. This is not just an abstract academic exercise: many startups are claiming to be able to simulate customer behavior in place of conducting traditional user research, and there is widespread interest among model developers in building better user simulators for LLM post-training.
Reflecting all this interest, there is now an emerging field studying the suitability of LLM agents as drop-in replacements for human participants in social science experiments. Most empirical papers take the following form. First, find an existing experiment (or set of experiments) that was conducted with real human participants. Then, re-run the experiment by presenting those experimental stimuli to an LLM agent and collecting its responses (possibly conditioning on a “persona” profile with demographic information to elicit these responses). Finally, measure the overlap between the simulated behavior and the ground-truth human participant data.
This research program of measuring agreement between model simulation and human behavior inherently selects for the use of experiments where it is straightforward to measure this agreement. In the Centaur paper, which measured alignment between LLM predictions and human experimental data across a variety of psychological tasks, the most-represented experimental task (comprising over 40% of all human behavioral trial outcomes used to fine-tune the model) is intertemporal choice. In this task, participants select between receiving some amount of money now, versus a (greater) amount at a later time - a decision problem with 2 fixed options. Similarly, in another large study (still under review), the primary set of experiments used to evaluate LLM fidelity were surveys in which participants indicated their responses via a numerical rating scale (e.g. a 7-point scale from “strongly disagree” to “strongly agree”). An LLM can be straightforwardly prompted to select A or B or produce a number between 1 and 7, producing outputs that are immediately comparable to the original empirical data. To use the parlance of AI research, there is a structural bias for verifiable domains - the types of experiments models are best suited to simulate are the ones where the faithfulness of that simulation (e.g., the “correctness” of the model’s generation) can be readily computed with off-the-shelf metrics.
Given the boldness of the idea being evaluated - literally removing humans from human behavioral experiments - this gravitation toward close-ended tasks seems eminently reasonable; we should be able to evaluate these claims quantitatively. But there is one core problem: the types of experiments that are most straightforward to simulate are the most constrained and simplified ones, which often bear the most tenuous connections to the cognitive capabilities and behaviors we presumably would want to capture in simulation.
An old problem, resurfaced
A structural challenge in psychological science is the fact that theoretical constructs of interest cannot directly be measured: instead, an experimenter must come up with a way to operationalize the construct in a way that can be instrumented. For example, if I am interested in social reasoning, then I might operationalize this by conducting an experiment to evaluate the types of inferences people make about other people’s mental states in a variety of scenarios.
Within cognitive psychology, this process of operationalization has traditionally involved the use of experimental tasks which serve as testbeds for studying the phenomenon of interest. One common thread - and key benefit - of these tasks is that they tend to constrain the set of possible behaviors in a way that makes it easy to verify the predictive accuracy of a statistical or cognitive model.
The core downside, however, is that these tasks often abstract away parts of the problem that could be consequential for understanding the original motivating phenomenon. To see this, let’s revisit the example of intertemporal choice. There are many real-world decisions that are problems of intertemporal choice, like whether one should spend a paycheck or save for the future, or go out with friends or prepare for an important meeting. But the experimental proxy setup - picking between a set of fixed options, each of which has an unambiguous value in the context of the study - is a highly stylized representation. In the real-world setting, people must integrate complex information and internally construct value signals (e.g. the value of catching up with friends versus the value of impressing your boss); these processes presumably also affect how people arrive at their decisions. This is not to say that these kinds of tasks are not useful; some degree of abstraction is inevitable and can often be highly instructive. But it raises questions about the generalizability of such studies - what inferences are licensed about underlying cognitive mechanisms, given the gap between the experimental setting and the real-world behavior?
Back to the question of simulating human behavior: these sorts of experiments, with the most stylized operationalizations of human behavior, are disproportionately the ones being used to adjudicate the claim that LLMs can or cannot simulate human behavior. And so these same questions about generalizability arise with simulation: if a model can recapitulate the distribution of human responses in a simplified laboratory experiment of 2-way decision-making, what can we conclude about that model’s capability to simulate human behavior across other classes of decisions? Without appropriate scoping, we might make unjustified inferential leaps and ultimately mislead ourselves about the extent to which LLMs can capture human behavior.
The problem isn’t LLMs, per se: rather, their proliferation is exposing deeper issues of ecological validity in social science experiments. Experimental psychologists have faced the trap of verifiable domains long before that term emerged. But this time around, the stakes are arguably higher, because methods and paradigms from the field are being used to make broader claims about a technology with a reach far beyond academic science.
There could, however, be a way out of this morass: the same tools that are surfacing this problem are the ones that could help us develop solutions, by giving us qualitatively new ways to analyze naturalistic human behavior at scale.
New paths forward
If the goal is to ultimately predict “real world” behavior, a clear next step is to study human behavior in settings that more closely reflect the dynamics of the original motivating phenomena. In the past, this was an extremely difficult thing to ask of scientists, because there was an inherent tradeoff between naturalism and quantitative analysis: you could study people doing the most complex thing in the world, but you wouldn’t have (quantitative) tools to make sense of those behaviors.
LLMs (and AI more broadly) fundamentally change that calculus: it is now possible to extract much richer dependent variables from more naturalistic and open-ended traces of behavior, therefore expanding the complexity and scope of our (human) experimental paradigms. To provide an example: in some recent work of my own, we collected a large dataset of free-form natural language conversations among pairs of human participants tasked with solving an open-ended joint planning problem; these pairs then were tasked with implementing their joint plan. We used an LLM to convert the structured agreements participants expressed in language into Python program representations. These program-like representations allowed us to simulate how a pair with a given strategy might behave on a different set of inputs. This in turn made it possible to directly compare strategy use across pairs and therefore to track how the internal algorithmic structure of the agreements changed over time - a dependent variable that simply would have not been feasible to measure in the pre-LLM era.
Of course, this is just one study, and it does not solve all the methodological challenges I previously outlined: this task is not a fully comprehensive proxy for how people engage in joint planning in the “real world”. But it helps, I hope, to burnish the case that we no longer have to choose between greater naturalism in experimental paradigms, and quantitative analysis: we increasingly can have both.
Once we build tools to characterize more complex, in-the-wild human behaviors, we will be better equipped to revisit the simulation debate and evaluate if LLMs can faithfully capture said behaviors. When conducting human experiments, we are often left wondering if theories that describe the pattern of findings in a particular experiment have any explanatory power beyond the lab setup. With LLMs, we can actually evaluate this quantitatively: we can test if models that are optimized to recapitulate behavior in laboratory tasks can also predict behavior in more naturalistic settings.
This is a key benefit afforded by LLMs, compared to previous classes of cognitive models; they can produce much more open-ended outputs (i.e. anything that can be expressed in natural language). So the same model that is optimized to match the distribution of behavior in binary choice lab experiments (through fine-tuning, prompting, scaffolding within a larger system, or any combination of these interventions) can also be used to simulate how people make more open-ended decisions. Given everything we’ve observed about AI progress, it seems within reason that a model could eventually predict with a high degree of accuracy human decisions in forced-choice laboratory experiments: the more interesting question is if this model could capture behavior in other classes of decision problems. Examining this necessitates finding more naturalistic behavioral datasets for this final evaluative step of checking agreement between model and human behavior - conversational traces from social media datasets could serve as one potential trove.
If we’re dealing with more open-ended forms of data and can’t use traditional metrics, how do we compare open-ended LLM simulations to ground-truth human behavioral data? LLMs can again be part of the solution: the same pipeline for characterizing structure in human behavior can be repurposed to characterize structure in simulator outputs, allowing for direct comparisons. Here, we’re drawing a distinction between verifiable and measurable: while measuring agreement between open-ended human behavior and agent simulations is not generally something that can be reduced to a single standardized metric, it is still possible to instrument it with rigor and quantitative precision. Of course, this still requires careful craftsmanship on the part of the (human) scientist, e.g. by identifying a suitable dependent variable, clearly operationalizing a definition of similarity with respect to that dependent variable, providing high-quality example annotations, and validating that LLM judgements actually align with experimenter-defined criteria.
In a recent essay, Hiranya Peiris made the case that the advent of LLMs in scientific practice is not actually creating qualitatively new problems, but rather exposing longstanding ones. Issues such as the explosion of low quality papers and the scientific shallowness of “data science for x” approaches have existed long before LLMs burst onto the scene. A similar dynamic is emerging in the discourse around LLM for simulating human behavior: we are rediscovering core methodological problems that social science has been grappling with for decades.
There is one (bleak) future in which the use of LLMs for simulation leads us to “squeeze more predictive juice out of flawed theories and inadequate paradigms”, as put vividly by Sayash Kapoor and Arvind Narayanan. But I think there is reason for optimism, because these tools fundamentally extend what is measurable and therefore expand the types of questions we can ask about complex behavioral datasets. AI could fulfill its promise of transforming social science - if viewed not merely as a way to scale up what’s been done, but rather to measure what we never could before.