iMario Accuracy Benchmark · v2
View on GitHubiMario's Synthetic Audiences: Proven as Accurate as Real Research.
Built on a decade of customer research and human insights, iMario pairs a real grasp of how people actually think and behave with careful engineering, anchored throughout in real demographic and sociological data, to simulate persistent synthetic individuals.
For anyone using them as a synthetic audience for market research, synthetic users for consumer insight, or a digital twin powering yourself, one question matters: Do they act/answer like real people?
That accuracy is what our customers care about most, and where iMario puts the most effort.
Instead of easy consumer type of questions, we validate on the hardest ones - politics, values and social attitudes: 900+ questions, real answers from ~47,000 people across 11 populations, reaching 89% consistency with the real distributions, and 90%+ when it can reference the most trusted public surveys.
Proven in the open
- Named public surveys: Pew Global, ANES (US), CGSS (China), Stack Overflow (Global Developers).
- Tested blind: when scoring a question, the model never sees the real population’s answer to it.
- Both industry-standard metrics, 1-TV and 1-MAE, reported side by side so the numbers compare directly across the field.
This benchmark is open for you to reproduce the results. And we would love your contribution making it better. Bring new datasets, populations, or critiques, and help build a broader, shared standard for synthetic-audience accuracy. Issues and pull requests are welcome on GitHub: iMario-benchmark.
This benchmark at a glance
About iMario
What iMario is
Every important decision a company makes should start with one question: what do real people actually think?
For a century that answer has been slow, expensive, and out of reach for all, including the largest brands. iMario is tryng to change that. It simulates synthetic audiences calibrated against real world human data, so any team can run research, test concept ideas and more with millions of people and trust what comes back in hours. Instant. Scalable. Verifiable.
Under the hood, iMario turns real-world people into interactive, memorable, orchestrable, and reusable Synthetic Individuals. The same individuals flex across the work: organized into synthetic audiences for customer research, standing in for synthetic users in product validation, playing the synthetic customer in sales rehearsal, becoming a LinkedIn or personality digital twin, or serving as the personality and human layer inside an AI agent. It is the simulation and on-demand recall of a human point of view.
Where iMario stands
Leading, and fully open.
Strict scale ( 1 − TV )
How far the synthetic distribution sits from the real one
Lenient scale ( 1 − MAE )
The same gap averaged over options, so it always reads higher
| Vendor | Headline metric | Named data? | Both metrics? | Reproducible? |
|---|---|---|---|---|
| iMario | 1-TV & 1-MAE | ● | ● | ● |
| Artificial Societies | 1-TV | ○ | ○ | ○ |
| Electric Twin | 1-MAE | ○ | ○ | ○ |
| SyntheticUsers | subjective rubric (n≈8) | ○ | ○ | ○ |
| Aaru | no public benchmark | ○ | ○ | ○ |
Methodology
How we score accuracy
raw 1-TV · Strict standard
How far the synthetic answer split sits from the real one. A perfect match is 100%, and nothing can pad the score. This is the strict, honest read, and the one we lead with.
1-MAE · Lenient standard
The same gap, but spread thin across every answer option, so the score drifts upward on its own as a question adds more choices. It always reads higher than the strict number.
Two metrics, side by side.
Both compare the synthetic answer split to the real one. The difference is how easy each is to flatter.
- 1-TV measures the true gap, and no amount of framing can make it look better than it is. The strict standard.
- 1-MAE spreads that same gap across every answer option, so the score quietly climbs on its own and always reads higher for the same data. The flattering one.
When a vendor only ever headlines a 1-MAE and never shows the strict 1-TV, the flattering math is doing the talking. We lead with 1-TV, publish 1-MAE beside it for honest comparison, and never quote one against the other.
Why this is a harder number than it looks
- We test the divisive questions, not the easy ones. Consumer and brand questions are mostly high-consensus, so almost anything looks accurate. We built this from the hard topics instead, politics, values, religion, social fairness and institutional trust, and score every population on its full pool, never a hand-picked easy subset.
- We report the honest, no-peeking number. Every question is answered blind, with no access to that population's own real answer. Let the model reference a recent real survey on the topic, the setting many vendors quietly report on, and the same engine reads 90%+. What you see here is the floor.
The honest baseline
Real research isn't a perfect gold standard either
Research based on real people is not the truth either. Ask the same person again next week and the answer moves; panels are gamed by professional respondents and bots; people shade sensitive answers; and most studies reach only a few people who rarely match the population.
Real fieldwork is also noisy, biased, and limited.
So the honest bar is not 100%. Rerun the same survey on a fresh real sample and the results agree only about 93% of the time. iMario reaches 89% on the strict scale, about 95% of that human ceiling, landing inside the range two real samples would.
Ask one person the same question twice and they agree only ~81% of the time. Individuals wobble far more than the population, where those shifts cancel out to the ~93% above.
Online panels lean on professional survey-takers, with bots, duplicates, and straightliners farming the incentive. Rarely the population they claim to be.
Social desirability, acquiescence, and satisficing pull stated answers away from real behavior, hardest on the sensitive, divisive questions this benchmark is built from.
Who we tested
Real populations, real composition
Results
How close to the real answers
Quantitative · All populations
Each bar is the synthetic-vs-real answer-distribution match across that population's full question set.
Qualitative · United States
1,000 synthetic US audiences answer these questions in their own words. We compare the predicted mix of topics against the real answers for the 86.2% match above. As a stricter check, an independent coder then reads every answer, real and synthetic, sorts each into one of 16 topics, and re-scores from scratch: 81.8%.
Per-question explorer
Every question, real vs synthetic
Accuracy across all 922 questions
Each dot = one question (raw 1-TV) · Box = middle 50% · Line = median · ◆ = meanDrill into individual questions, by populationTop 20 by consistency
↓ Download the full responsesIn practice
What this accuracy means for your decisions
When the synthetic split says 62/38, the real split lands within a few points. You act on the read, not a guess.
Pick the winning concept, price, or feature with confidence that the order would hold up with real people.
Niche professionals, hard-to-reach segments, and whole countries, modeled and ready in hours instead of weeks.
Run the study, get a population-accurate read, and move, without waiting on fieldwork.
FAQ
Frequently asked questions
What is a synthetic individual?+
The synthetic individual is the atom iMario is built from: a persistent AI persona modeled on real demographic and behavioral data, so it thinks, decides, and answers like a specific real person rather than a generic chatbot. Everything else is a composition of them — many grouped into a synthetic audience for research, one standing in as a synthetic user for product and UX validation, or one grounded in a real person as a digital twin of a customer. You build the individual once and reuse it across audiences, users, and customers.
What is a synthetic audience?+
A synthetic audience is a group of synthetic individuals: AI personas modeled on real-world data so they think and respond like real people. iMario uses them to test ideas, reach any audience, and de-risk decisions in hours.
How accurate are synthetic audiences?+
On named public surveys, iMario's synthetic audiences match the real answer distribution to about 89% (raw 1-TV) on the hardest leave-one-out questions, rising to 90%+ when anchored to a recent survey. That is close to the ceiling even a fresh re-run of a real survey reaches.
What's the difference between a synthetic audience, synthetic users, and a digital twin?+
They are the same synthetic individuals in different roles: grouped into a synthetic audience for market research, standing in as synthetic users for product and UX validation, or acting as a digital twin of a specific person or customer.
How do you measure accuracy?+
For every question we compare the synthetic answer distribution to the real survey's, option by option. We report two standard metrics: raw 1-TV (the field's strict standard, the same number Artificial Societies reports) and 1-MAE (the metric Electric Twin reports).
What data is the benchmark run on?+
Named public national surveys anyone can download: Pew Global Attitudes and ANES (United States), CGSS (China), and the Stack Overflow Developer Survey (global developers). We do not test on private data.
Can I reproduce these numbers?+
Yes. The personas, every answer, the scoring code, and the source data are published on GitHub. Download them, re-run the test, and check our work. Found a flaw? Tell us and we will fix it in the open.
How does iMario compare to other synthetic audience tools (Artificial Societies, Electric Twin, GWI)?+
On the same 1-TV distribution-accuracy yardstick, iMario reaches about 89% on named public data anyone can download and re-run, versus 86% published by Artificial Societies. Unlike most tools, every persona, answer, and line of scoring code is open for inspection.
Is ~89% accuracy actually good?+
Very. Even real surveys are not a perfect gold standard: rerun the same survey on a fresh sample and the distributions agree only about 93% of the time, so ~89% lands right at the noise floor of real fieldwork. At the individual level, people agree with themselves only about 81% when re-asked two weeks later, so even a real person is not 100% consistent.
What can you use a synthetic audience for?+
Message testing, concept and pricing tests, survey pre-testing, sizing hard-to-reach and international audiences, and product or UX validation, in hours instead of weeks.
Can synthetic audiences replace real surveys?+
They let you test and decide in hours instead of weeks, across audiences you often cannot recruit. For the highest-stakes calls you can still validate against a small real sample whenever you want.
Test any idea on an audience you can trust.
The fine print
Disclosures & limitations
- ·Every question is scored blind: the model never sees that population's own real answer, or a same-year answer to the same question.
- ·The ~93% human-reproduction ceiling is estimated from Pew 2024→2025 same-question overlap and applied pool-wide.
- ·Vendor figures (Artificial Societies, Electric Twin) are each company's own published number, on their own non-public data.