No One Can Predict Anyone, Not Even You

In 1985, Coca-Cola ran roughly 200,000 taste tests. Participants preferred the sweeter new formula by a clear margin. So the company reformulated its flagship drink, launched New Coke, and watched the public revolt. Seventy nine days later, the original was back on shelves. Two hundred thousand people had told Coca-Cola exactly what they wanted. Two hundred thousand people were wrong about themselves.
Fast forward thirty five years. In 2020, Quibi raised 1.75 billion dollars to build a short-form streaming app, with A-list talent, deep market research, and two of the most respected executives in tech. It projected roughly seven million subscribers in year one. Six months after launch it had around 500,000, and it folded. Again, all the conviction and money in the world could not turn "people say they want this" into "people actually do it."
Fast forward once more, to 2026. One of the most starred new AI projects on GitHub is an open-source "swarm intelligence engine" called MiroFish, whose tagline is literally "Predict Anything." Open its own FAQ, though, and it quietly tells the truth: treat the output as "exploratory decision support," a rehearsal "before using judgment, analytics, and real-world validation." Even the boldest tool in the room, when it is being honest, retreats from the word predict to the word rehearse.
Forty years. The instruments went from focus groups to a million interacting AI agents. The mistake never changed. We keep believing we can predict what people will do, and we keep being wrong.
This is an essay about why that belief is a category error, where the real limit sits, and the one thing that actually works.
The trillion-dollar obsession with predicting user behavior
The dream of predicting consumer behavior is everywhere, and it is expensive. Companies spend on the order of a billion dollars a year on purchase-intent forecasting alone. The payoff is poor. Roughly four out of five new consumer products fail in market. And the people who tell a survey they will "definitely buy" follow through far less often than the answer implies. Practitioners routinely discount even that top-box group, because only somewhere between two-thirds and three-quarters of them actually purchase.
Nowhere is the gap clearer than in sustainability. Surveys consistently find that 60 to 80 percent of consumers say they care about sustainability and will pay more for greener products. Then Bain & Company measured what happens at the shelf: consumers say they will pay about a 12 percent premium, while companies actually charge a sustainability premium closer to 28 percent, more than double what shoppers say they will bear, and green products remain a small slice of actual sales. People are not lying. They genuinely believe what they say. They simply cannot forecast their own behavior under real budgets, real timing, and real trade-offs.
Here is the part most commentary misses. This is not a flaw of AI, surveys, or any single method. The obsession is older than the technology, and it is not really the researchers who are confused. Good researchers have taught the limits of stated preference for decades. It is the market's demand for prediction, the executive who wants a study to function as a prophecy, that keeps the delusion alive. The tool is not the problem. The target is.
Why you can't predict what one person will do
Start with the cleanest finding in the field: the say-do gap, also called the intention-behavior gap. Across hundreds of studies, intention correlates with behavior at about r = 0.53. That sounds strong until you translate it: it explains only about 28 percent of the variance, and roughly 47 percent of people who say they intend to do something never do it. The economist's version is even blunter. People who sign up for a gym on an expensive monthly plan attend so rarely that they pay far more per visit than a pay-as-you-go pass would cost. They confidently predicted their future selves, and their future selves did not show up.
Why is one person so hard to forecast? Two reasons, one about too few dimensions and one about too many.
The low-dimension problem is that we badly overestimate how much of a person a handful of traits can capture. The Big Five personality model, the most validated framework psychology has, explains only about 9 to 16 percent of the variance in any single behavior. Traits do predict a person's behavior averaged over many situations far better, which is the individual-versus-aggregate lesson playing out inside one person, but for the specific next act the ceiling is low. The trait-to-behavior correlation has a famous name, the "personality coefficient," and a famous size, roughly r = 0.2 to 0.4. The popular alternative, the Myers-Briggs Type Indicator, is a cautionary tale: when people retake it after just five weeks, roughly half get a different four-letter type. A model that reclassifies half its subjects in a month is not measuring a stable thing. It is a clock stopped at 3:00. Perfectly consistent, perfectly useless, because consistency is not the same as accuracy. The fact that millions of people believe in dimensional typing tells you about the appeal of the story, not the validity of the prediction.
The high-dimension problem is the mirror image. A single real behavior is the product of a trait, times a situation, times a passing mood, times pure noise. The situational and random components swamp the stable trait component. The most humbling data point: a real person re-answering the same survey two weeks later is only about 81 percent consistent with their own earlier answers. You are not a deterministic function of your own personality. You cannot reliably predict yourself.
When researchers tried to beat this with scale, they lost. In the Fragile Families Challenge, 160 teams used machine learning and a rich longitudinal dataset of thousands of variables per family to predict life outcomes. The best models reached an R-squared of about 0.2 for the most predictable outcome, and close to zero for most others. More data and better algorithms did not lift the ceiling, because the ceiling is a property of human behavior, not of the model.
One honest caveat keeps this from overreaching: routine, habitual behavior is predictable. You will probably buy coffee again, open the same app, take the same commute. What resists prediction is the specific, novel, non-habitual act, which is exactly the kind of behavior most business decisions hinge on.
"But we predict behavior all the time": the individual versus aggregate line
At this point a sharp reader objects, correctly: insurers, banks, and tech companies predict behavior every single day, and they are right. So which is it?
Both, and the resolution is the spine of this entire essay. You can predict consumer behavior in aggregate, never for a single person. What those industries forecast is the rate across a pool, not the act of any one member.
A life insurer's mortality tables forecast how many people in a large pool will die next year with stunning accuracy. They cannot tell you whether you will. A credit score, in FICO's own words, exists to "rank-order the likelihood of borrowers' credit repayment risk" and is explicitly "not designed to provide a specific, fixed estimate of credit risk" for an individual. The odds are defined over a pool of 760-scorers versus a pool of 700-scorers. A/B tests, demand forecasts, churn models, and election forecasts are all the same shape: precise about the group, agnostic about the person.
This is just the law of large numbers. As the nineteenth-century statistician Quetelet observed, crime rates and suicide rates are eerily stable from year to year, even though no individual crime or suicide can be foreseen. Random individual variation cancels out across many people. The signal lives in the ensemble, not the unit.
Two cautions keep this honest. First, aggregate prediction holds only while the underlying behavior is stable. When the regime shifts, even aggregate models shatter: some COVID-19 forecasts were off by an order of magnitude or more, precisely because human behavior and policy changed in ways the models could not see. Second, beware the viral statistic that "human behavior is 93 percent predictable." That number, from a well-known mobility study, is a theoretical upper bound derived from entropy, not an achieved accuracy, and what it actually predicts is routine, the fact that you are usually at home or at work. It is the predictability of habit, not of the person.
The one-line takeaway: predict the crowd, not the person.
Why more behavioral data, even social media, doesn't crack it
If individual behavior resists prediction but we now have oceans of behavioral data from social media and clickstreams, surely that closes the gap? It does not, and the reason is a single principle worth memorizing.
Aggregation cancels random noise. It does not cancel systematic bias.
Insurance and credit work because the predictor and the target are the same kind of behavior, observed directly. Past payment behavior predicts future payment behavior. Past purchases predict future purchases. Social media to purchase is a different and broken chain, for three reasons. First, a domain mismatch: what you post, like, and share is expressive behavior. This is the classic gap between stated and revealed preference, the difference between what people say and what they do when money is on the line. Aggregating a billion expressive signals does not remove that directional bias, it just gives you a very confident wrong answer. Second, representativeness: the people who post are a vocal minority, not the buying population. Third, the empirical link is simply weak. A 2026 meta-analysis of psychological targeting found digital footprints explain only about 5 percent of the variance in personality, and the downstream effect on behavior is near zero. Adding social-media sentiment to a sales forecast tends to help only marginally and inconsistently across studies, and what it tracks is attention, not purchase.
The nuance that keeps this fair: social data does predict same-domain social behavior. It can tell you which headline gets more clicks, because clicks are the same domain as posting. It just cannot jump the gap to purchase. And the same boundary applies across product categories: behavior in one category transfers weakly to another. Marketing science finds cross-category correlations of only about 0.32 to 0.58 even for price and promotion sensitivity, which means a model calibrated on coffee tells you very little about enterprise software. You do not need more data. You need the right kind of data, in the right domain, tied to real outcomes.
What simulation actually does: group choice, not individual behavior
So where do AI simulations and synthetic users fit, if not as crystal balls? The honest answer reframes the whole category.
First, a distinction that dissolves an apparent paradox. Generating a synthetic individual is not the same as predicting a real one. Whether you call them synthetic personas or synthetic respondents, a synthetic agent is a sampling unit used to build a population distribution, not a forecast of any named person. No serious practitioner claims a synthetic agent tells you what customer Jane will do next Tuesday. The foundational paper in this space, Argyle and colleagues' Out of One, Many, names the goal algorithmic fidelity: matching the distribution of a population, while making clear it does not imply the model can simulate a specific individual. The individual is scaffolding. The population is the unit of validity.
Second, what these systems are good at is simulating choice in a controlled frame, not action in the wild. Given these three options described this way, which does a population prefer, and what trade-offs drive it? This is the same question that choice-based conjoint analysis, a survey method run on real people, has answered for decades. Done well, conjoint reaches held-out hit rates of roughly 50 to 70 percent, with aggregate share predictions often more accurate than individual choices. Whether language-model simulation can match that track record is the open question, and the honest current answer is this: only with category-specific calibration, and not yet at the individual level. It works at all because it stays inside the decision frame and does not try to bet on the chaos of realized behavior.
Third, a warning the field is still digesting. Language models have a homogenizing pull toward an "average persona." In a rigorous Columbia study, Digital Twins as Funhouse Mirrors, researchers built individual twins from more than 500 data points per person and still found the twins under-dispersed in 94 percent of outcomes, with an individual correlation to their real human of only about r = 0.20. Here is the trap: aggregation does not fix this, because the bias is systematic and in the same direction for every agent. A beautiful aggregate accuracy number can sit on top of a population of nearly identical, collapsed individuals. The crowd looks right while having lost exactly the diversity you were trying to capture.
Calibration is the whole game of behavioral prediction, and it's category-specific
If simulation produces stated, in-frame, possibly over-homogenized choice, what makes it trustworthy? One thing: calibration to real outcomes.
Think of the thermometer that always reads five degrees high. On its own it is useless. But hold it next to a known temperature a few times, learn that it is biased by plus five, and subtract that going forward, and you have a reliable instrument. Calibration is exactly this for behavioral simulation: you run predictions on cases where the real outcome is already known, measure the systematic gap, and correct it.
Two properties make or break it, and they are worth turning into questions you ask before trusting any simulated panel. First, is the calibration category-specific? The say-do gap itself varies enormously by category. Stated intent predicts car purchases almost perfectly in aggregate, while for aspirational or social-signaling goods like sustainable food the gap is two to three times worse, so a calibration learned in one category does not transfer cleanly to another. Second, does it correct the spread of opinion, not just the average? Because aggregation can hide homogenized individuals, the real test is whether the simulated diversity matches the real diversity, which you check by comparing the spread of synthetic responses to the spread of a human benchmark, not just the means.
This sets a standard worth saying out loud, because the market mostly ignores it. Most accuracy numbers you will see in this space, the 90-something-percent figures, are self-reported by the vendor. Genuinely independent, public, reproducible validation is rare. The honest claim is not "we hit 92 percent." It is "here is our method, here is our data, here is the prediction we registered in advance, here is the real outcome, go check it yourself." Until a number can survive that, treat it as a hypothesis, not a result.
The weather forecast model of human insight
There is a mature field that predicts human-scale events well, and it made peace with all of this a century ago: meteorology.
A weather forecast cannot tell you whether a raindrop will land on your head. That single event is hopelessly noisy. What it can tell you, with genuine and improving accuracy, is that there is a calibrated 70 percent chance of rain across the region this afternoon. Aggregate, probabilistic, and validated against what actually happened, over and over, until the numbers can be trusted.
That is the correct model for human insight, and the correct posture for anyone building or buying behavioral simulation. Stop asking what one person will do. Nobody can answer that, not the best AI, not the best researcher, not the person themselves. Ask instead how a population will lean among a set of options, hold that answer to a real-world outcome, and report it with the humility of a probability rather than the false confidence of a fact.
Predict the crowd, not the person. Simulate a population, calibrate it to reality, and never pretend the map is the territory. The forty-year mistake was never the tool. It was the question. Change the question, and the science finally works in your favor.