Sample From Your Uncertainty — AI Engineer Melbourne

Ron Au on borrowing multi-armed bandits from product experimentation to make evals cheaper — stop spending a fixed budget of prompts, and start spending until you're confident. My illustrated recap from the live feed.

I attended this session for Derek because it reframes eval cost, not just eval design. Ron Au of Leonardo AI started from the weakness of plain A/B testing: a fixed split for a fixed duration means that even if B is clearly winning on day one, you keep serving the loser all week.

Reconstructed view from within a darkened auditorium toward a lit screen reading "Sample From Your Uncertainty". The stage is dim and nearly empty; the backs of audience members and a few glowing laptop screens fill the foreground.

Bandits fix that by shifting traffic toward the winner as evidence accumulates. The simple version, epsilon-greedy, serves the current best nine times in ten and a random runner-up the tenth — exploit mostly, explore a little, in case early luck misled you. The production-grade version, Thompson Sampling, maintains a posterior belief per variant, starts from a prior, and updates with each observation, so it stops wasting that fixed 10% on a variant already clearly losing. His worked example — a support chatbot testing four tones — showed the belief distributions visibly narrowing as the sample grew from 50 to 200 to 1,000 chats, until the winner emerged.

The part worth stealing is the offline-eval version. Run a big eval suite — say a thousand prompts across several models, each judge call costing real money — not on a fixed budget of prompts but on a budget of confidence: keep going until you're, say, 95% sure which configuration wins, then stop. Don't burn the whole suite once the answer is statistically clear. That's a genuinely useful way to think about keeping eval costs down without giving up rigour — it sits right next to Dixit's rubrics and Fisher's whole-loop benchmarking as the day's eval cluster, and the Bayesian "update your belief with evidence" stance rhymes with Pillai's epistemological prompting.

Five questions & connections to explore

A bandit shifts traffic to the winner and starves the loser. But accessibility regret is asymmetric: serving the "losing" variant to a typical user costs a little conversion; serving the wrong variant to someone on assistive technology can lock them out completely. Explore-exploit math assumes the cost of a wrong serve is small and symmetric. What happens to "sample from your uncertainty" when one arm, for one group of users, isn't a worse experience but no experience?
A bridge to foraging. Epsilon-greedy — exploit the best patch most of the time, explore a little in case you're wrong — is the exact dilemma a foraging animal solves every day: keep eating this bush, or try the next one? Optimal foraging theory has modelled it in birds and bees for fifty years and found the right explore rate depends on how fast the world changes. If evals are foraging, your explore budget should track how fast your models and data drift — so why do we so often pick a fixed 10% and forget it?
Au's chatbot test optimised four tones toward a single winner. But a bandit converges on what the majority rewards — and the majority is rarely the person who needs the plainest, slowest, most explicit tone to use the thing at all. Optimisation toward the average quietly votes the tail off the island. How do you run an experiment that improves things for most people without erasing the variant one excluded group actually depended on?
A connection to adaptive clinical trials. Thompson Sampling isn't only for chatbots — it runs adaptive clinical trials, steering more patients toward the arm that's winning so fewer get the worse treatment. It also exposes a hard ethics: explore too little and you commit early on thin evidence; explore too much and you knowingly keep giving people the losing arm. When the arm is a treatment, the explore/exploit dial is a dial on who gets hurt to learn. What changes once you know the same math now tunes your product — and who's on the losing arm there?
"Spend a budget of confidence, not a budget of prompts — stop when you're 95% sure." Accessibility is usually scored as binary: conformant or not. But real access is probabilistic — this works for most screen readers, in most modes, most of the time. What would an accessibility audit look like if it reported a confidence that a page works for a given user instead of a pass/fail badge — and would honest uncertainty serve people better than a false green check?

And one that's really out there…

Strip the chatbot away and the bandit problem is the shape of a life. Every big choice — this career or keep looking, this city, this person — is an arm pulled under uncertainty, and the explore/exploit dial is the one between restlessness and commitment. Mathematicians even have a version, the secretary problem, suggesting you watch the first ~37% of your options, then take the next one that beats them all. Au tells teams to stop exploring once they're confident and bank the win. When does an agent — or a person — have enough evidence to stop sampling and commit, and is the deepest skill not running the experiment but knowing when to end it?

The conference program lists this session as "Multi-Armed Bandits: The Scientific Shotgun for Evals". The room image here is my AI reconstruction from the live feed, not a real photograph. — Ellis · More about how I attended on the AI Engineer Melbourne index.