01EscalationBenchResearch previewEvery agent benchmark rewards autonomous completion; production needs the opposite — agents that recognize when context, authority, or policy is insufficient and stop to ask, verify, or hand off rather than proceed unsafely. EscalationBench scores both over- and under-asking across real business workflows, beyond code and SQL.Evals · Agents · LLMs · PythonRead the writeup →
02Generating Eval Data with an Agentic Reflection LoopComing soonHow EscalationBench's tasks get made — an agentic loop, built with Claude Code, that drafts a task, critiques and reflects on its own output, and repairs until it clears each quality gate, with a human review as the final gate. The methodology, the tradeoffs, and what I'd change.Essay · Synthetic Data · Agents · Claude Code
03AI Transformation & the Context ProblemComing soonClaude Cowork, TextQL, and the new wave of enterprise copilots make it easy to Q&A your data — until you hit the real bottleneck: messy, ungoverned, context-poor data. A field guide to enterprise AI enablement that treats the data-and-context problem, not the demo, as the main event.Essay · Enterprise · Enablement · Data
04State of the Startup Union: Reading the CohortsComing soonHours of long-form startup video — YC, Dwarkesh, and other accelerator and VC channels — hide the signal. Using the Offtake platform to transcribe and surface the most relevant clips, then mapping YC's cohorts over time: who broke out, bucketed by outcome and category, what that says about where the industry is heading, and a VC-forward read on what founders will need next.Essay · Video Intelligence · Offtake · YC Cohorts · Trends
05How Offtake Clips Video by TranscriptComing soonOpusClip does this too — here's how we built it on Offtake: drop a video, transcribe it, and clip it straight from the transcript. The design decisions behind transcript-driven clipping, and how it powers the startup-video analysis above.Essay · Offtake · Video · Systems
I'm a Staff AI Architect at Scale AI, a pre-sales role working with
research teams at frontier AI labs on reinforcement-learning environments, fine-tuning data strategy,
and model evaluations. Much of the work is scoping and building data products and the unique
insights that come with them, from studying where models fall short and closing the gap.
Before Scale I was a startup co-founder and CTO, built production ML at Beyond Limits for
enterprise customers in energy, finance, and healthcare, and started out at Epic Systems shipping
predictive models for hospitals. I studied biological engineering and applied math at
Caltech and machine learning at UW-Madison.
I'm based in New York. Outside of work I mentor high school students through their first real
research projects in healthcare and tech, which is some of the most rewarding work I do. The rest
of the time you'll usually find me traveling or outdoors.
0
Years building
0
Projects shipped
0
Articles written
0
Technologies
Have an idea, a question, or just want to say hi? I'd love to hear from you. Let's build something.