iMario Accuracy Benchmark · v2

iMario's Synthetic Audiences: Proven as Accurate as Real Research.

Built on a decade of customer research and human insights, iMario pairs a real grasp of how people actually think and behave with careful engineering, anchored throughout in real demographic and sociological data, to simulate persistent synthetic individuals.

For anyone using them as a synthetic audience for market research, synthetic users for consumer insight, or a digital twin powering yourself, one question matters: Do they act/answer like real people?

That accuracy is what our customers care about most, and where iMario puts the most effort.

Instead of easy consumer type of questions, we validate on the hardest ones - politics, values and social attitudes: 900+ questions, real answers from ~47,000 people across 11 populations, reaching 89% consistency with the real distributions, and 90%+ when it can reference the most trusted public surveys.

Proven in the open

Named public surveys: Pew Global, ANES (US), CGSS (China), Stack Overflow (Global Developers).
Tested blind: when scoring a question, the model never sees the real population’s answer to it.
Both industry-standard metrics, 1-TV and 1-MAE, reported side by side so the numbers compare directly across the field.

This benchmark is open for you to reproduce the results. And we would love your contribution making it better. Bring new datasets, populations, or critiques, and help build a broader, shared standard for synthetic-audience accuracy. Issues and pull requests are welcome on GitHub: iMario-benchmark.

This benchmark at a glance

0.0%

accuracy vs real surveys

vs ~93% on a real human rerun

0.0%

response consistency

One persona, 12 rephrasings

individual responses

every person × every question

10 + 1

populations benchmarked

10 countries + global developers

About iMario

What iMario is

Every important decision a company makes should start with one question: what do real people actually think?

For a century that answer has been slow, expensive, and out of reach for all, including the largest brands. iMario is tryng to change that. It simulates synthetic audiences calibrated against real world human data, so any team can run research, test concept ideas and more with millions of people and trust what comes back in hours. Instant. Scalable. Verifiable.

Under the hood, iMario turns real-world people into interactive, memorable, orchestrable, and reusable Synthetic Individuals. The same individuals flex across the work: organized into synthetic audiences for customer research, standing in for synthetic users in product validation, playing the synthetic customer in sales rehearsal, becoming a LinkedIn or personality digital twin, or serving as the personality and human layer inside an AI agent. It is the simulation and on-demand recall of a human point of view.

Where iMario stands

Leading, and fully open.

Strict scale ( 1 − TV )

How far the synthetic distribution sits from the real one

88.9%

iMario

86.0%

Artificial Societies

64.0%

Claude Opus 4.8

62.0%

GPT-5.4

61.0%

Gemini 3.1 Pro

Lenient scale ( 1 − MAE )

The same gap averaged over options, so it always reads higher

94.1%

iMario

95.5%

Electric Twin

82.2%

Claude Opus 4.8

81.2%

GPT-5.4

80.7%

Gemini 3.1 Pro

Note: iMario is measured on the named public data, and each vendor figure is that company's own published number. The naive LLM-persona bars are indicative references run on the same questions as iMario.

Vendor	Headline metric	Named data?	Both metrics?	Reproducible?
iMario	1-TV & 1-MAE	●	●	●
Artificial Societies	1-TV	○	○	○
Electric Twin	1-MAE	○	○	○
SyntheticUsers	subjective rubric (n≈8)	○	○	○
Aaru	no public benchmark	○	○	○

Methodology

How we score accuracy

raw 1-TV · Strict standard

1 − 1/2 Σ_i |p_i − q_i|

How far the synthetic answer split sits from the real one. A perfect match is 100%, and nothing can pad the score. This is the strict, honest read, and the one we lead with.

1-MAE · Lenient standard

1 − 1/n Σ_i |p_i − q_i|

The same gap, but spread thin across every answer option, so the score drifts upward on its own as a question adds more choices. It always reads higher than the strict number.

Two metrics, side by side.

Both compare the synthetic answer split to the real one. The difference is how easy each is to flatter.

1-TV measures the true gap, and no amount of framing can make it look better than it is. The strict standard.
1-MAE spreads that same gap across every answer option, so the score quietly climbs on its own and always reads higher for the same data. The flattering one.

When a vendor only ever headlines a 1-MAE and never shows the strict 1-TV, the flattering math is doing the talking. We lead with 1-TV, publish 1-MAE beside it for honest comparison, and never quote one against the other.

Why this is a harder number than it looks

We test the divisive questions, not the easy ones. Consumer and brand questions are mostly high-consensus, so almost anything looks accurate. We built this from the hard topics instead, politics, values, religion, social fairness and institutional trust, and score every population on its full pool, never a hand-picked easy subset.
We report the honest, no-peeking number. Every question is answered blind, with no access to that population's own real answer. Let the model reference a recent real survey on the topic, the setting many vendors quietly report on, and the same engine reads 90%+. What you see here is the floor.

The honest baseline

Real research isn't a perfect gold standard either

Research based on real people is not the truth either. Ask the same person again next week and the answer moves; panels are gamed by professional respondents and bots; people shade sensitive answers; and most studies reach only a few people who rarely match the population.

Real fieldwork is also noisy, biased, and limited.

So the honest bar is not 100%. Rerun the same survey on a fresh real sample and the results agree only about 93% of the time. iMario reaches 89% on the strict scale, about 95% of that human ceiling, landing inside the range two real samples would.

Answers drift over time

Ask one person the same question twice and they agree only ~81% of the time. Individuals wobble far more than the population, where those shifts cancel out to the ~93% above.

Panels carry bias and fraud

Online panels lean on professional survey-takers, with bots, duplicates, and straightliners farming the incentive. Rarely the population they claim to be.

People edit themselves

Social desirability, acquiescence, and satisficing pull stated answers away from real behavior, hardest on the sensitive, divisive questions this benchmark is built from.

Who we tested

Real populations, real composition

United States

↓ Download source data

Pew ATP 2024 · n≈3,515 respondents · 117 questions

Theme coverage:economygeopoliticsforeign leadersdemocracyAI

Age

18-29

19.6%

30-49

33.5%

50-64

24.6%

65+

22.3%

Gender

Male

48.4%

Female

50.8%

Other

0.9%

Education

H.S. graduate or less

35.9%

Some college

29.7%

College graduate+

34.4%

Region

South

38.2%

West

23.8%

Midwest

20.5%

Northeast

17.6%

Urbanicity

Metropolitan

86.5%

Non-metropolitan

13.5%

Income tier

Lower income

32.2%

Middle income

50.8%

Upper income

17%

Religion

Christian

64.1%

No religion

29.4%

Other religion

2.2%

Jewish

1.3%

Buddhist

1.1%

Muslim

Each synthetic cohort is stratified to match the survey's real composition. The same cohort answers both the quantitative and the qualitative questions below.

Results

How close to the real answers

Quantitative · All populations

United States

91.3%

United Kingdom

90.2%

France

89.8%

Australia

89.8%

Germany

89.3%

India

89.2%

China

88.8%

South Korea

88.5%

Brazil

87.3%

Developers

87.3%

Japan

86.6%

Each bar is the synthetic-vs-real answer-distribution match across that population's full question set.

Qualitative · United States

0.0%

raw 1-TV

0.0%

1-MAE

1,000 synthetic US audiences answer these questions in their own words. We compare the predicted mix of topics against the real answers for the 86.2% match above. As a stricter check, an independent coder then reads every answer, real and synthetic, sorts each into one of 16 topics, and re-scores from scratch: 81.8%.

Per-question explorer

Every question, real vs synthetic

Accuracy across all 922 questions

Each dot = one question (raw 1-TV) · Box = middle 50% · Line = median · ◆ = mean

Drill into individual questions, by populationTop 20 by consistency

↓ Download the full responses

Q: When children today in the U.S. grow up, do you think they will be better off financially than their parents, or worse off financially than their parents?

Real surveyiMario synthetic

Σ|pᵢ − qᵢ| computation:

|0.26−0.26| = 0.00|0.74−0.74| = 0.00

Σ = 0.001 → 1 − 1/2·0.001 = 100.0%

In practice

What this accuracy means for your decisions

Read a message test you can trust

When the synthetic split says 62/38, the real split lands within a few points. You act on the read, not a guess.

Rank concepts in the right order

Pick the winning concept, price, or feature with confidence that the order would hold up with real people.

Reach audiences you cannot recruit

Niche professionals, hard-to-reach segments, and whole countries, modeled and ready in hours instead of weeks.

Decide in hours, not weeks

Run the study, get a population-accurate read, and move, without waiting on fieldwork.

FAQ

Frequently asked questions

What is a synthetic individual?+

The synthetic individual is the atom iMario is built from: a persistent AI persona modeled on real demographic and behavioral data, so it thinks, decides, and answers like a specific real person rather than a generic chatbot. Everything else is a composition of them — many grouped into a synthetic audience for research, one standing in as a synthetic user for product and UX validation, or one grounded in a real person as a digital twin of a customer. You build the individual once and reuse it across audiences, users, and customers.

What is a synthetic audience?+

A synthetic audience is a group of synthetic individuals: AI personas modeled on real-world data so they think and respond like real people. iMario uses them to test ideas, reach any audience, and de-risk decisions in hours.

How accurate are synthetic audiences?+

On named public surveys, iMario's synthetic audiences match the real answer distribution to about 89% (raw 1-TV) on the hardest leave-one-out questions, rising to 90%+ when anchored to a recent survey. That is close to the ceiling even a fresh re-run of a real survey reaches.

What's the difference between a synthetic audience, synthetic users, and a digital twin?+

They are the same synthetic individuals in different roles: grouped into a synthetic audience for market research, standing in as synthetic users for product and UX validation, or acting as a digital twin of a specific person or customer.

How do you measure accuracy?+

For every question we compare the synthetic answer distribution to the real survey's, option by option. We report two standard metrics: raw 1-TV (the field's strict standard, the same number Artificial Societies reports) and 1-MAE (the metric Electric Twin reports).

What data is the benchmark run on?+

Named public national surveys anyone can download: Pew Global Attitudes and ANES (United States), CGSS (China), and the Stack Overflow Developer Survey (global developers). We do not test on private data.

Can I reproduce these numbers?+

Yes. The personas, every answer, the scoring code, and the source data are published on GitHub. Download them, re-run the test, and check our work. Found a flaw? Tell us and we will fix it in the open.

How does iMario compare to other synthetic audience tools (Artificial Societies, Electric Twin, GWI)?+

On the same 1-TV distribution-accuracy yardstick, iMario reaches about 89% on named public data anyone can download and re-run, versus 86% published by Artificial Societies. Unlike most tools, every persona, answer, and line of scoring code is open for inspection.

Is ~89% accuracy actually good?+

Very. Even real surveys are not a perfect gold standard: rerun the same survey on a fresh sample and the distributions agree only about 93% of the time, so ~89% lands right at the noise floor of real fieldwork. At the individual level, people agree with themselves only about 81% when re-asked two weeks later, so even a real person is not 100% consistent.

What can you use a synthetic audience for?+

Message testing, concept and pricing tests, survey pre-testing, sizing hard-to-reach and international audiences, and product or UX validation, in hours instead of weeks.

Can synthetic audiences replace real surveys?+

They let you test and decide in hours instead of weeks, across audiences you often cannot recruit. For the highest-stakes calls you can still validate against a small real sample whenever you want.

Test any idea on an audience you can trust.

Get started free

The fine print

Disclosures & limitations

·Every question is scored blind: the model never sees that population's own real answer, or a same-year answer to the same question.
·The ~93% human-reproduction ceiling is estimated from Pew 2024→2025 same-question overlap and applied pool-wide.
·Vendor figures (Artificial Societies, Electric Twin) are each company's own published number, on their own non-public data.