What is Braun & Clarke's thematic analysis?

Thematic analysis is a qualitative research method introduced by Virginia Braun and Victoria Clarke in 2006. It moves through six phases: familiarizing with the data, generating initial codes, searching for themes, reviewing themes, defining and naming themes, and producing the report. The framework treats analysis as an active, iterative process rather than passive summarization, and it has become a standard methodology in psychology, health sciences, and product research.

How is AI-driven thematic analysis different from manual thematic analysis?

The methodology is the same. The labor is different. Manual thematic analysis requires a researcher to code every transcript line by line, scaling linearly with sample size and often taking weeks for a 30-respondent study. With LLMs handling the coding and theme synthesis, the same study can be processed in minutes. iMario adds a reflexive review layer that checks coverage, redundancy, and prevalence using deterministic metrics, so AI speed does not compromise rigor.

What is the Pyramid Principle and why does iMario use it?

The Pyramid Principle is Barbara Minto's framework for structuring business communication. A document leads with a single governing thought, supported by 3 to 5 mutually exclusive findings, each with its own evidence and recommendation. iMario uses it at the top of the report engine because thematic analysis alone produces faithful but inert output. The Pyramid Principle gives findings a shape that a product or marketing team can act on the same day they read the report.

Can you trust an AI-generated research report?

Trust depends on traceability. iMario's reports are structured as a reference graph: every finding cites specific themes, every theme cites categories, every category cites codes, and every code cites a verbatim quote span and respondent. A QA script can verify each link automatically. When a report says "22 of 30 respondents accepted the price with clearer ROI framing," that number was derived from a graph traversal, not paraphrased by an LLM.

How does iMario prevent the LLM from hallucinating evidence in findings?

All ID references are validated against strict JSON schemas before reaching the next layer. If a finding cites a theme_id that does not exist, or a category cites a code_id outside the question's code pool, the output is rejected and the LLM call retries. Derived fields like prevalence, sentiment distribution, and supporting code lists are computed by deterministic graph traversal, not written by the LLM. The LLM picks references. The engine verifies them.

The Five Layers Behind an iMario Research Report

There are two main ways to write a research report.

Framework	Starting point	Evidence chain	Output
Academic (Braun & Clarke, grounded theory)	Bottom-up from raw transcripts	Every claim traces to coded data	Themes plus descriptive narrative
Consulting (Minto's Pyramid Principle)	Top-down from a hypothesis	Selective examples support the claim	Governing thought plus 3 to 5 actionable findings

The old trade-off between the two was time. Academic rigor took weeks because a human had to code every transcript line by line and induce themes by hand. With LLMs handling the coding and synthesis in minutes, that bottleneck is gone. What remains is a choice of shape: faithful description, or actionable claim.

iMario picks both. Academic methodology at the bottom of the pipeline so every claim is grounded in coded data. Consulting structure at the top so the report is something you can act on the same afternoon you read it. We build our report engine around a five-layer pipeline with a reference graph running through it. Codes belong to categories. Categories roll up into themes. Themes back specific findings. Pull on any string in the final report, and you land on a verbatim respondent quote.

Here is what each layer does.

Layer 0: Atomic codes

When an interview finishes, every response is segmented into atomic codes. A code carries a short label, the reasoning behind the tag, a sentiment value, the exact quote span it came from, the respondent ID, and the question index. A 30-respondent study with 10 qualitative questions typically produces 2,000 to 4,000 codes. Each one is a self-contained unit of meaning, attached to a real moment in the transcript.

Layer 1: Per-question categorization

For each question, we group its codes into 5 to 10 categories. Each category has a short definition that distinguishes it from its neighbors, an attitude cross-checked against the sentiments of the supporting codes, and an explicit list of code_ids drawn from the question's actual code pool. If any ID does not exist in that pool, we reject the output and retry. Categories also expose 3 to 5 representative codes, picked for clarity, not for being first in the array.

The same call also writes the per-question narrative as three structured fields: majority positions, minority positions, and outliers. Each outlier carries its own quote span and respondent ID. A study that buries its dissenters is a study that surprises you in production, so outliers get their own slot in the schema rather than living inside a freeform paragraph the LLM can compress.

Layer 2: Cross-question themes

Themes are patterns that span at least two questions. Instead of feeding the LLM 4,000 raw codes (the old pipeline tried this and choked past the 500K token mark), we feed it roughly 60 categories. The prompt stays under 50K tokens. The LLM picks category_ids. The graph then derives code_ids, prevalence, sentiment distribution, and cohort breakdowns through deterministic post-processing. The LLM does not write those numbers. It cannot inflate them.

Layer 3: Reflexive review

Before any narrative gets written, we compute seven metrics on the candidate themes: code coverage, respondent coverage, the highest pairwise Jaccard overlap between themes, the Gini coefficient of theme prevalence, the share of themes that span more than one question, the count of single-question themes that should be demoted, and the count of orphan codes that fell outside every theme. If any threshold trips, for example Jaccard above 0.5, code coverage below 0.85, or Gini above 0.65, we send the themes back through a critique LLM call with the specific failure as context. Merges, splits, and demotions get logged as a change list. Orphan codes are not dropped. They surface in the report as edge signals worth watching, because a finding only earns its weight when the cases that did not fit are visible. This is the engineering equivalent of Braun & Clarke's Phase 4. Themes have to survive scrutiny before they reach the report.

Layer 3.5: Findings and governing thought

This is where the methodology shifts from academic to consulting. A single LLM call produces two outputs at once: a 2 to 4 sentence governing thought that captures the overarching narrative, and 3 to 7 structured findings. Each finding includes the statement, the implication (so what), the recommendation (do what), a priority rank, and at least one theme_id from the layer above. The LLM is instructed not to fabricate findings to cover every theme. Coverage is a property we measure, not a constraint we force on the model.

Example: Walking through a pricing study

A founder runs a 30-person study before launching a $49/month tier. Ten questions. Roughly 3,000 atomic codes come out the other end.

On "What is your initial reaction to the $49 price?", one respondent says: "$49 feels steep for what I'm seeing on the landing page. If I could see how it saves me 5 hours a week, maybe." Three codes get extracted: a price-value gap (negative), an openness if ROI is clear (mixed), and a need for concrete time savings (neutral).

The 80-or-so codes from this question cluster into 6 categories. Value justification gap holds 24 codes from 18 respondents. Open if ROI is provable holds 11 codes from 9 respondents.

Layer 2 looks at the ~60 categories across all 10 questions. A theme surfaces: Value justification as a purchase barrier, spanning the pricing question, the landing page question, and the competitor comparison. Reflexive review catches a 0.62 Jaccard overlap with a separate ROI clarity candidate, fires a critique call, merges the two. Final prevalence: 22 of 30 respondents.

The governing thought reads: The pricing message has a credibility problem before it has a price problem. The top finding records that 22 of 30 respondents would accept $49 if the ROI math sat above the fold, while 8 bounced on price alone. Implication: a copywriting problem, not a pricing one. Recommendation: A/B test a "saves you 5 hours a week" headline against the current $49 anchor before discounting. Priority 1.

That last paragraph is what the founder reads first. Everything below it is what makes it trustworthy.

The reference graph, end to end

Every finding cites themes. Every theme cites categories. Every category cites codes. Every code cites a quote span and a respondent. Nothing in the final report exists without a chain back to the data. When the pricing report above says "22 of 30 would accept $49 with clearer ROI framing," that number is not a paraphrase. It came from a graph traversal over the supporting codes, deduplicated by respondent ID.

A side benefit we did not plan for

Because every output is structured and every reference is a real ID, the entire report is auditable by software, not just by human readers. A QA script can verify that every code_id resolves to a real code, that every theme's prevalence matches the count of distinct respondents in its supporting codes, and that every finding cites a theme that exists. None of those checks would survive against a single LLM that wrote everything in one shot.

A research report is only as trustworthy as the path from raw transcript to final claim. We made that path five layers deep, traceable, and machine-checkable. The methodology dates back to 2006. The plumbing did not exist before now.

Run a Research Report →