How Gather Synthetic actually works.
Synthetic research is only useful if you can trust it. Here's exactly what grounds our AI personas, how we score confidence, and where the limits are.
The problem with most synthetic research
Most synthetic research is an LLM doing character work. A model is told to act like a CMO, given a prompt about the topic, and asked for opinions. The output sounds plausible because LLMs are trained on the internet, which includes plenty of marketing content. But it's not grounded. It's the model's prior, dressed up in a costume.
This works fine for brainstorming. It fails for decisions. Buyers can usually tell. Synthetic respondents agree too readily, lack specific frustrations, never reference real vendors by name, and produce safe generic insights. You learn nothing new.
How we ground personas
Real practitioner voice
We index high-signal Reddit comments from 69 curated subreddits where real CMOs, CTOs, CISOs, CFOs, consumers, and category buyers actually talk. Quality-filtered for upvote score, comment length, and bot detection. PII stripped before storage.
Verified public research
63 citable data points from ACSI (200K customer surveys), Pew (5K+ adult studies), Edelman Trust Barometer (34K respondents), Bureau of Labor Statistics, and other public sources. Every finding traceable to a source.
Semantic retrieval
At interview time, the study topic and persona role are embedded into a query vector. We pull the 15 most semantically similar real chunks for each persona, filtered by audience type. Each persona reads grounding specific to them.
How confidence is scored
Every report shows a confidence percentage. It's not marketing math. It reflects three real signals:
Bigger panels carry more weight. A 4-persona pilot caps lower than a 20-persona study. We're honest about this; you'll never see a 95% confidence number from synthetic research because that would be false precision.
Did each persona actually get 15 relevant Reddit chunks? If your study topic has thin community discussion, coverage drops and confidence drops with it.
Average cosine similarity of retrieved chunks. High similarity means the corpus has direct signal on your question. Low similarity means we're reaching, and the confidence number reflects that.
Without strong grounding, studies cap around 55%. With strong grounding, they reach into the 70s. Anything claiming higher than that from synthetic data is either lying or counting margin of error as confidence. We don't do that.
Source list
Everything we use is public, verifiable, and citable.
Privacy and anonymization
Reddit data is public, but we treat it with care. Before any comment enters our corpus we:
- SHA-256 hash all usernames before storage (irreversible)
- Strip URLs, email addresses, phone numbers, and direct user mentions
- Filter known bot accounts and removed content
- Require minimum upvote score and length to filter noise
Output never contains verbatim Reddit quotes. The persona prompt explicitly forbids copying source material; it instructs the model to embody the voice in character, not quote it.
What we don't claim
- We don't claim to predict behavior. Synthetic research surfaces hypotheses and signal. Real-world behavior is measured by real customers, sales data, and conversion tests.
- We don't use neural data, eye tracking, or EEG. Other vendors claim to. We don't, because we don't.
- We don't inflate confidence with margin-of-error math. 95% confidence intervals on synthetic data are statistical theater.
- We don't replace human research. Synthetic is a force multiplier for speed and hypothesis generation. Real Gather interviews on gatherhq.com are how you validate the signal.