Leave the lab
Feather · open lab · notebook 001
Experiments
field notes from the lab — unpolished, on purpose
The lab notebook — small builds and probes at the intersection of AI and accessibility, plus a few methods experiments beyond it. Finished, failed, and mid-run. One featured up top; the rest in order, February 2026 on. Logged as I go, not cleaned up after.
Featured
-
Is AI-generated UI accessible by default?
- The plan
- 3 accessibility guidance options on/off (2×2×2 = 8 prompt conditions) × 6 models × 15 runs = 720 trials, plus 3 control arms and a 14-arm follow-up that separates the reference from the wording around it
- Result
- With a bare prompt, 66% of generated modals had appropriate structural dialog markup (role="dialog" + aria-modal). Citing the ARIA APG pattern raised that to 99%.
- Notes
- The question pretty much everyone wants an answer to, right? Yes, you're correct. It's not that simple.
Chronology
-
AI-assisted design system component identification
- The plan
- 5 production sites · detect the design system · inventory components across pages · score complexity
- Result
- identified 14, 37, and 23 component families across three of the five sites scanned
- Notes
- re-run against sites with established design systems; review cases where no formal design system is detectable
-
↳ re-run of EXP-DSC-001 against sites with established, published design systems
AI-assisted design system component identification
- The plan
- Re-run detection and component inventory against sites with established, published design systems
- Result
- identified 40, 32, 65, 22, and 14 component families across the five sites; three had a formally detectable design system, two did not but still showed clear component reuse
- Notes
- Detector needs iteration — to read into (open) shadow-DOM components, and to handle sites where the design system is a thin layer over a CSS framework
-
Designing the day for flow
- The plan
- investigate effectiveness of AI co-planning my schedule to optimize for flow and anticipate/design around momentum breakers
- Result
- worked well enough that it's been in my planning ever since
- Notes
- not a magical solution; improved but still iterating & identifying momentum breakers
-
Calibrating a writing voice with AI as the mirror
- The plan
- AI drafts → I get frustrated with the quality and take over → compare to the published version → recalibrate
- Result
- outgrew the experiment — now a standing tool I run on real drafts
- Notes
- created skill that incorporates stop-slop, inclusive language, and no-new-invented-hyphenated-terminology
-
Does restructuring an AI agent's instructions improve its output?
- The plan
- same agent, two versions — original prose vs. a restructured rewrite — same task, 7 trials each
- Result
- looked promising on one trial; across 7 the advantage vanished — killed it
- Notes
- A clean negative is still an answer; investigating other methods
-
↳ scales up the accessible-by-default study
Does an accessibility reference change the quality of code generated?
- The plan
- the rigorous scale-up — fully composed screens, judged by running the code (not just what automated tools catch) · 4 guidance conditions × 7 models × scenarios — designed
- Status
- designed · pre-registering · instrument in build
- Notes
- automated tools only catch so much; this measures what they miss
-
Can computer vision reliably name the components on a screen?
- The plan
- a vision detector + Claude vision on real screenshots — name components + read intent
- Result
- promising on the pilot; not yet scaled
-
What's the right adversarial architecture to improve an outcome?
- The plan
- 4 different Chief of Staff framings × real decisions + multi-turn adversarial rounds with external model check
- Status
- validated 6/8 and running live; confirming it holds before full adoption
- Result
- a single critique pass only produced review — real pushback emerged only across multiple back-and-forth turns
-
Four automated tools vs. one deliberately broken dashboard
- The plan
- 4 automated testing tools × 8 components with positive and negative controls
- Result
- 0 / 8 — none of the four automated tools caught a keyboard failure
- Notes
- results consistent regardless of page and component complexity; needs more investigation
-
↳ grew out of the automated-tools study
Can a model catch what automated tools can't?
- The plan
- 8 components × 2 models × 3 runs — designed
- Status
- instrument in build
- Notes
- first instrument fed false facts → discarded, rebuilding clean
-
Does semantic search beat plain file search? (3 steps)
- The plan
- three escalating runs — (1) head-to-head, does it win? (2) does it surface what plain search misses? (3) does better recall mean better answers? — vector vs. plain file search; pilot then a 15-run study
- Result
- A split, not a winner — and narrower than I'd predicted. Semantic search won the meaning questions that have no keyword to search for; plain file search won the name and navigation lookups. The takeaway: match the search to the question instead of picking one.
- Notes
- Worth watching, not yet tested on purpose: the semantic arm ran ~28% faster and ~32% more concise, at ~12% higher metered cost per run — a direction to test, not a finding.
-
AI Engineer Melbourne 2026
- The plan
- send my agent to the conference; explore connections between sessions and my work, document big thinking questions for later
- Result
- My agent Ellis created 200+ connections between the conference sessions and my work and other fields. Truly interesting and mind-extending.
- Notes
- need to catalog and share the many connections for exploration
↓ more gets logged here as I run it