The Five Layers Behind an iMario Research Report

There are two main ways to write a research report.
| Framework | Starting point | Evidence chain | Output |
|---|---|---|---|
| Academic (Braun & Clarke, grounded theory) | Bottom-up from raw transcripts | Every claim traces to coded data | Themes plus descriptive narrative |
| Consulting (Minto's Pyramid Principle) | Top-down from a hypothesis | Selective examples support the claim | Governing thought plus 3 to 5 actionable findings |
The old trade-off between the two was time. Academic rigor took weeks because a human had to code every transcript line by line and induce themes by hand. With LLMs handling the coding and synthesis in minutes, that bottleneck is gone. What remains is a choice of shape: faithful description, or actionable claim.
iMario picks both. Academic methodology at the bottom of the pipeline so every claim is grounded in coded data. Consulting structure at the top so the report is something you can act on the same afternoon you read it. We build our report engine around a five-layer pipeline with a reference graph running through it. Codes belong to categories. Categories roll up into themes. Themes back specific findings. Pull on any string in the final report, and you land on a verbatim respondent quote.
Here is what each layer does.
Layer 0: Atomic codes
When an interview finishes, every response is segmented into atomic codes. A code carries a short label, the reasoning behind the tag, a sentiment value, the exact quote span it came from, the respondent ID, and the question index. A 30-respondent study with 10 qualitative questions typically produces 2,000 to 4,000 codes. Each one is a self-contained unit of meaning, attached to a real moment in the transcript.
Layer 1: Per-question categorization
For each question, we group its codes into 5 to 10 categories. Each category has a short definition that distinguishes it from its neighbors, an attitude cross-checked against the sentiments of the supporting codes, and an explicit list of code_ids drawn from the question's actual code pool. If any ID does not exist in that pool, we reject the output and retry. Categories also expose 3 to 5 representative codes, picked for clarity, not for being first in the array.
The same call also writes the per-question narrative as three structured fields: majority positions, minority positions, and outliers. Each outlier carries its own quote span and respondent ID. A study that buries its dissenters is a study that surprises you in production, so outliers get their own slot in the schema rather than living inside a freeform paragraph the LLM can compress.
Layer 2: Cross-question themes
Themes are patterns that span at least two questions. Instead of feeding the LLM 4,000 raw codes (the old pipeline tried this and choked past the 500K token mark), we feed it roughly 60 categories. The prompt stays under 50K tokens. The LLM picks category_ids. The graph then derives code_ids, prevalence, sentiment distribution, and cohort breakdowns through deterministic post-processing. The LLM does not write those numbers. It cannot inflate them.
Layer 3: Reflexive review
Before any narrative gets written, we compute seven metrics on the candidate themes: code coverage, respondent coverage, the highest pairwise Jaccard overlap between themes, the Gini coefficient of theme prevalence, the share of themes that span more than one question, the count of single-question themes that should be demoted, and the count of orphan codes that fell outside every theme. If any threshold trips, for example Jaccard above 0.5, code coverage below 0.85, or Gini above 0.65, we send the themes back through a critique LLM call with the specific failure as context. Merges, splits, and demotions get logged as a change list. Orphan codes are not dropped. They surface in the report as edge signals worth watching, because a finding only earns its weight when the cases that did not fit are visible. This is the engineering equivalent of Braun & Clarke's Phase 4. Themes have to survive scrutiny before they reach the report.
Layer 3.5: Findings and governing thought
This is where the methodology shifts from academic to consulting. A single LLM call produces two outputs at once: a 2 to 4 sentence governing thought that captures the overarching narrative, and 3 to 7 structured findings. Each finding includes the statement, the implication (so what), the recommendation (do what), a priority rank, and at least one theme_id from the layer above. The LLM is instructed not to fabricate findings to cover every theme. Coverage is a property we measure, not a constraint we force on the model.
Example: Walking through a pricing study
A founder runs a 30-person study before launching a $49/month tier. Ten questions. Roughly 3,000 atomic codes come out the other end.
On "What is your initial reaction to the $49 price?", one respondent says: "$49 feels steep for what I'm seeing on the landing page. If I could see how it saves me 5 hours a week, maybe." Three codes get extracted: a price-value gap (negative), an openness if ROI is clear (mixed), and a need for concrete time savings (neutral).
The 80-or-so codes from this question cluster into 6 categories. Value justification gap holds 24 codes from 18 respondents. Open if ROI is provable holds 11 codes from 9 respondents.
Layer 2 looks at the ~60 categories across all 10 questions. A theme surfaces: Value justification as a purchase barrier, spanning the pricing question, the landing page question, and the competitor comparison. Reflexive review catches a 0.62 Jaccard overlap with a separate ROI clarity candidate, fires a critique call, merges the two. Final prevalence: 22 of 30 respondents.
The governing thought reads: The pricing message has a credibility problem before it has a price problem. The top finding records that 22 of 30 respondents would accept $49 if the ROI math sat above the fold, while 8 bounced on price alone. Implication: a copywriting problem, not a pricing one. Recommendation: A/B test a "saves you 5 hours a week" headline against the current $49 anchor before discounting. Priority 1.
That last paragraph is what the founder reads first. Everything below it is what makes it trustworthy.
The reference graph, end to end
Every finding cites themes. Every theme cites categories. Every category cites codes. Every code cites a quote span and a respondent. Nothing in the final report exists without a chain back to the data. When the pricing report above says "22 of 30 would accept $49 with clearer ROI framing," that number is not a paraphrase. It came from a graph traversal over the supporting codes, deduplicated by respondent ID.
A side benefit we did not plan for
Because every output is structured and every reference is a real ID, the entire report is auditable by software, not just by human readers. A QA script can verify that every code_id resolves to a real code, that every theme's prevalence matches the count of distinct respondents in its supporting codes, and that every finding cites a theme that exists. None of those checks would survive against a single LLM that wrote everything in one shot.
A research report is only as trustworthy as the path from raw transcript to final claim. We made that path five layers deep, traceable, and machine-checkable. The methodology dates back to 2006. The plumbing did not exist before now.