Methodology

How Gather Synthetic actually works.

Synthetic research is only useful if you can trust it. Here's exactly what grounds our AI personas, how we score confidence, and where the limits are.

Reddit Grounding
77,243
real comments embedded
Subreddits
69
curated communities
Public Research
63
verified findings (ACSI, Pew, Edelman, BLS)
Refreshed
June 14, 2026
weekly auto-refresh

The problem with most synthetic research

Most synthetic research is an LLM doing character work. A model is told to act like a CMO, given a prompt about the topic, and asked for opinions. The output sounds plausible because LLMs are trained on the internet, which includes plenty of marketing content. But it's not grounded. It's the model's prior, dressed up in a costume.

This works fine for brainstorming. It fails for decisions. Buyers can usually tell. Synthetic respondents agree too readily, lack specific frustrations, never reference real vendors by name, and produce safe generic insights. You learn nothing new.

How we ground personas

Layer 1

Real practitioner voice

We index high-signal Reddit comments from 69 curated subreddits where real CMOs, CTOs, CISOs, CFOs, consumers, and category buyers actually talk. Quality-filtered for upvote score, comment length, and bot detection. PII stripped before storage.

Layer 2

Verified public research

63 citable data points from ACSI (200K customer surveys), Pew (5K+ adult studies), Edelman Trust Barometer (34K respondents), Bureau of Labor Statistics, and other public sources. Every finding traceable to a source.

Layer 3

Semantic retrieval

At interview time, the study topic and persona role are embedded into a query vector. We pull the 15 most semantically similar real chunks for each persona, filtered by audience type. Each persona reads grounding specific to them.

What the persona actually sees
Before answering any question, every synthetic persona receives a prompt that includes (1) their character profile, (2) the relevant public research findings for the study topic, and (3) 15 specific Reddit comments from communities where this kind of person actually posts. The instruction is explicit: embody the voice, do not quote verbatim. The result is synthetic answers that reference real vendors, mirror real frustrations, and surface real disagreements, because the source material does.

How confidence is scored

Every report shows a confidence percentage. It's not marketing math. It reflects three real signals:

Sample size

Bigger panels carry more weight. A 4-persona pilot caps lower than a 20-persona study. We're honest about this; you'll never see a 95% confidence number from synthetic research because that would be false precision.

Grounding coverage

Did each persona actually get 15 relevant Reddit chunks? If your study topic has thin community discussion, coverage drops and confidence drops with it.

Semantic relevance

Average cosine similarity of retrieved chunks. High similarity means the corpus has direct signal on your question. Low similarity means we're reaching, and the confidence number reflects that.

The ceiling is 78%, not 95%

Without strong grounding, studies cap around 55%. With strong grounding, they reach into the 70s. Anything claiming higher than that from synthetic data is either lying or counting margin of error as confidence. We don't do that.

Source list

Everything we use is public, verifiable, and citable.

01
American Customer Satisfaction Index. 200,000 customer surveys, 47 industries.
02
Pew Research Center
View source →
5,000+ adult surveys on AI attitudes, technology adoption, consumer behavior.
03
Edelman Trust Barometer
View source →
34,000 respondents across 28 countries on institutional trust, business ethics, and consumer brand perception.
04
Bureau of Labor Statistics
View source →
Consumer Expenditure Survey. Real household spending data by demographic.
05
Reddit Public Archives
View source →
Real practitioner and consumer voice from 50+ curated subreddits via Arctic Shift, the open Pushshift successor.

Privacy and anonymization

Reddit data is public, but we treat it with care. Before any comment enters our corpus we:

  • SHA-256 hash all usernames before storage (irreversible)
  • Strip URLs, email addresses, phone numbers, and direct user mentions
  • Filter known bot accounts and removed content
  • Require minimum upvote score and length to filter noise

Output never contains verbatim Reddit quotes. The persona prompt explicitly forbids copying source material; it instructs the model to embody the voice in character, not quote it.

What we don't claim

  • We don't claim to predict behavior. Synthetic research surfaces hypotheses and signal. Real-world behavior is measured by real customers, sales data, and conversion tests.
  • We don't use neural data, eye tracking, or EEG. Other vendors claim to. We don't, because we don't.
  • We don't inflate confidence with margin-of-error math. 95% confidence intervals on synthetic data are statistical theater.
  • We don't replace human research. Synthetic is a force multiplier for speed and hypothesis generation. Real Gather interviews on gatherhq.com are how you validate the signal.