Google I/O Hackathon · May 2026

Delphi

Synthetic populations as a computational substrate. You ask any question — will the Fed cut rates?, pretest this tagline, stress-test this decision — and dozens of Gemini 3.5 Flash sub-agents, each role-playing a different American persona generated from US Census demographics and grounded in live web, reason in parallel. ~60 seconds to a probabilistic forecast plus a Wall Street Journal-style synthesis.

Initial ~ 15%
After CPI shock ~ 80%
Headline shift + 65 pp
Time to swing ~ 30 s

Demo video

What it does

Three modes on one architecture. Forecast mode produces a probability with confidence interval. Pretest mode returns positive / neutral / negative sentiment for a tagline, message, or product concept. Stress-test mode surfaces the strongest objections to a decision before you ship it.

Every run returns: a headline number, ±1σ band, the strongest reasons for and against drawn from the agents' own reasoning, where demographics diverged most, and one striking outlier quote — all synthesized into a publishable paragraph by a final Gemini call.

Forecast Pretest Stress-test News-shock re-run Persona drill-down

Architecture

Validation

Architecture · automated test suite aggregator math, schema parsing, HTTP & WS lifecycle 27 / 27
Persona stability · adversarial LLM-as-judge in-character score across 24 charged-prompt trials 4.92 / 5
Model dependence Flash 3.5 vs Flash 2.5 per-agent success rate + 87.5 pp
Scale end-to-end at N = 200, wall-clock 207 s, 89% per-agent success N = 200

Not yet calibrated against real polling baselines. Synthetic forecasts have not been benchmarked against Polymarket, Pew, Gallup, or the Good Judgment Project. That's the next research piece — backtest harness against historical events scored by Brier loss against the GJP median.

Try it

The wedge

Forecasting is one use case. The same primitive pre-tests marketing campaigns at population scale, war-games policy decisions before they ship, runs synthetic juries for trial-message testing, and models behaviour for public-health interventions. Anywhere you need to ask how will the population react, this is the substrate.

Until Flash 3.5's cost × speed × intelligence frontier landed, hundreds of grounded reasoning agents in parallel at conversational latency was economically impossible. The cross-model comparison above is the evidence: the previous generation collapses on the same prompts.

Decision-support tool. Not a calibrated forecasting product. Synthetic forecasts have not been validated against real human polling baselines. Use the output as directional evidence — where the swarm converges and where demographic groups diverge — not as ground truth. Calibration against historical events scored by Brier loss is the next research piece.