Google I/O Hackathon · May 2026
Delphi
Synthetic populations as a computational substrate. You ask any question
— will the Fed cut rates?, pretest this tagline,
stress-test this decision — and dozens of Gemini 3.5 Flash
sub-agents, each role-playing a different American persona generated
from US Census demographics and grounded in live web, reason in
parallel. ~60 seconds to a probabilistic forecast plus a Wall Street
Journal-style synthesis.
Initial
~ 15%
After CPI shock
~ 80%
Headline shift
+ 65 pp
Time to swing
~ 30 s
What it does
Three modes on one architecture. Forecast mode
produces a probability with confidence interval. Pretest
mode returns positive / neutral / negative sentiment for a tagline,
message, or product concept. Stress-test mode surfaces
the strongest objections to a decision before you ship it.
Every run returns: a headline number, ±1σ band, the strongest reasons
for and against drawn from the agents' own reasoning, where
demographics diverged most, and one striking outlier quote — all
synthesized into a publishable paragraph by a final Gemini call.
Forecast
Pretest
Stress-test
News-shock re-run
Persona drill-down
Architecture
-
N parallel Gemini 3.5 Flash sub-agents, each with a
persona-specific system prompt sampled from US Census–aligned
demographic axes (age, region across 9 Census divisions, education,
income, occupation, belief axis).
-
Native Google Search grounding on every agent — they
read live news as they reason, not from a stale embedding index.
-
Structured JSON output via Pydantic schema — the
model picks position + confidence + reasoning, it cannot drift off
format.
-
Aggregator computes mean, ±1σ, distribution buckets,
and a per-demographic-axis breakdown — deterministic Python, no LLM
in the critical path.
-
Final synthesis call — one Gemini call after
aggregation produces a WSJ-style narrative, reasons for / against, a
demographic split sentence, and a curated outlier quote.
-
News-shock re-run — same personas, augmented
context, re-reason in ~30s. The synthesis paragraph rewrites itself
to cite the new information.
-
Production cap of N = 20 for sub-90-second runs at
conversational latency. Stress-tested end-to-end up to N = 200.
Validation
Architecture · automated test suite
aggregator math, schema parsing, HTTP & WS lifecycle
27 / 27
Persona stability · adversarial
LLM-as-judge in-character score across 24 charged-prompt trials
4.92 / 5
Model dependence
Flash 3.5 vs Flash 2.5 per-agent success rate
+ 87.5 pp
Scale
end-to-end at N = 200, wall-clock 207 s, 89% per-agent success
N = 200
Not yet calibrated against real polling baselines.
Synthetic forecasts have not been benchmarked against Polymarket, Pew,
Gallup, or the Good Judgment Project. That's the next research piece —
backtest harness against historical events scored by Brier loss against
the GJP median.
Try it
Live app
Type a question, pick a mode (forecast / pretest / stress), set N,
hit Convene swarm. Inject a news shock to see the headline
swing. Free-tier Gemini quota; expect a few timed-out agents.
Source on GitHub
FastAPI + asyncio + google-genai + Tavily fallback grounding · Next
15 + React 19 + Three.js (R3F) frontend · 27 / 27 tests · MIT.
Demo video
Three-minute walkthrough — convene a swarm on the Fed-cut
question, drill into a persona, inject a CPI shock, watch the
+65 pp swing.
Slides + evaluation deck
20-slide deck covering the pitch, three live showcase modes,
adversarial stability eval, cross-model comparison, limitations,
and roadmap.
PRD
Product requirements doc — vision, problem, target users, goals,
functional & non-functional requirements, architecture, risks,
and 12-month roadmap.
Live API docs
FastAPI / Swagger UI for the deployed backend on Railway. All
endpoints documented with request and response schemas.
The wedge
Forecasting is one use case. The same primitive pre-tests marketing
campaigns at population scale, war-games policy decisions before
they ship, runs synthetic juries for trial-message testing, and
models behaviour for public-health interventions. Anywhere you need
to ask how will the population react, this is the
substrate.
Until Flash 3.5's cost × speed × intelligence frontier landed,
hundreds of grounded reasoning agents in parallel at conversational
latency was economically impossible. The cross-model comparison
above is the evidence: the previous generation collapses on the
same prompts.
Decision-support tool. Not a calibrated forecasting product.
Synthetic forecasts have not been validated against real human polling
baselines. Use the output as directional evidence — where the
swarm converges and where demographic groups diverge — not as ground
truth. Calibration against historical events scored by Brier loss is the
next research piece.