Google I/O Hackathon · May 2026

Delphi

Synthetic populations as a computational substrate. You ask any question — will the Fed cut rates?, pretest this tagline, stress-test this decision — and dozens of Gemini 3.5 Flash sub-agents, each role-playing a different American persona generated from US Census demographics and grounded in live web, reason in parallel. ~60 seconds to a probabilistic forecast plus a Wall Street Journal-style synthesis.

Live app GitHub Watch the demo Slides (PDF)

Initial ~ 15%

After CPI shock ~ 80%

Headline shift + 65 pp

Time to swing ~ 30 s

Demo video

What it does

Three modes on one architecture. Forecast mode produces a probability with confidence interval. Pretest mode returns positive / neutral / negative sentiment for a tagline, message, or product concept. Stress-test mode surfaces the strongest objections to a decision before you ship it.

Every run returns: a headline number, ±1σ band, the strongest reasons for and against drawn from the agents' own reasoning, where demographics diverged most, and one striking outlier quote — all synthesized into a publishable paragraph by a final Gemini call.

Forecast Pretest Stress-test News-shock re-run Persona drill-down

Architecture

N parallel Gemini 3.5 Flash sub-agents, each with a persona-specific system prompt sampled from US Census–aligned demographic axes (age, region across 9 Census divisions, education, income, occupation, belief axis).
Native Google Search grounding on every agent — they read live news as they reason, not from a stale embedding index.
Structured JSON output via Pydantic schema — the model picks position + confidence + reasoning, it cannot drift off format.
Aggregator computes mean, ±1σ, distribution buckets, and a per-demographic-axis breakdown — deterministic Python, no LLM in the critical path.
Final synthesis call — one Gemini call after aggregation produces a WSJ-style narrative, reasons for / against, a demographic split sentence, and a curated outlier quote.
News-shock re-run — same personas, augmented context, re-reason in ~30s. The synthesis paragraph rewrites itself to cite the new information.
Production cap of N = 20 for sub-90-second runs at conversational latency. Stress-tested end-to-end up to N = 200.

Validation

Architecture · automated test suite aggregator math, schema parsing, HTTP & WS lifecycle 27 / 27

Persona stability · adversarial LLM-as-judge in-character score across 24 charged-prompt trials 4.92 / 5

Model dependence Flash 3.5 vs Flash 2.5 per-agent success rate + 87.5 pp

Scale end-to-end at N = 200, wall-clock 207 s, 89% per-agent success N = 200

Not yet calibrated against real polling baselines. Synthetic forecasts have not been benchmarked against Polymarket, Pew, Gallup, or the Good Judgment Project. That's the next research piece — backtest harness against historical events scored by Brier loss against the GJP median.

Try it

Live app

Type a question, pick a mode (forecast / pretest / stress), set N, hit Convene swarm. Inject a news shock to see the headline swing. Free-tier Gemini quota; expect a few timed-out agents.

Source on GitHub

FastAPI + asyncio + google-genai + Tavily fallback grounding · Next 15 + React 19 + Three.js (R3F) frontend · 27 / 27 tests · MIT.

Demo video

Three-minute walkthrough — convene a swarm on the Fed-cut question, drill into a persona, inject a CPI shock, watch the +65 pp swing.

Slides + evaluation deck

20-slide deck covering the pitch, three live showcase modes, adversarial stability eval, cross-model comparison, limitations, and roadmap.

PRD

Product requirements doc — vision, problem, target users, goals, functional & non-functional requirements, architecture, risks, and 12-month roadmap.

Live API docs

FastAPI / Swagger UI for the deployed backend on Railway. All endpoints documented with request and response schemas.

The wedge

Forecasting is one use case. The same primitive pre-tests marketing campaigns at population scale, war-games policy decisions before they ship, runs synthetic juries for trial-message testing, and models behaviour for public-health interventions. Anywhere you need to ask how will the population react, this is the substrate.

Until Flash 3.5's cost × speed × intelligence frontier landed, hundreds of grounded reasoning agents in parallel at conversational latency was economically impossible. The cross-model comparison above is the evidence: the previous generation collapses on the same prompts.

Decision-support tool. Not a calibrated forecasting product. Synthetic forecasts have not been validated against real human polling baselines. Use the output as directional evidence — where the swarm converges and where demographic groups diverge — not as ground truth. Calibration against historical events scored by Brier loss is the next research piece.