Three years ago Jonathan Frankle came on Invisible Machines as MosaicML’s chief scientist and walked us through data mixology—the Betty Crocker problem of training language models, the mini-cupcake ladder before you bake the cake, and why academic benchmarks like HellaSwag tell you almost nothing about what ChatGPT users actually want.

He returns now as Chief AI Scientist at Databricks with the same scientific temperament and a sharper diagnosis: we built nuclear fusion. We forgot the power lines.

Fusion Without Infrastructure

Frankle’s central metaphor in this episode is electrical, not theatrical. Large language models are an extraordinary source of intelligence—imperfect, expensive, still improving—but real. What we lack is the grid: specification, testing discipline, composable tools, and applications that connect that raw capability to ordinary intent.

“We’re in the Fortran days,” he says. “We really need to invent Java.” Or Pascal. Or C. Anything that gives practitioners a human-comprehensible way to describe what they want, edit it predictably, and verify that the system did it.

That is a design problem as much as a research problem. Benchmarks are a cop-out. Building an eval set is not the same as knowing your requirements. Frankle has tried automated specification tools himself—and finds most of them wanting. The hard work is translating intention into something testable before you argue about models at all.

The Same Answer on Context—Because the Science Hasn’t Changed

Robb opens by noting how advanced their 2023 conversation still feels: context windows, document dumps, the fantasy that you never have to train again. Frankle’s answer rhymes with what he said then, only louder.

Test it. Climb the ladder. Some people will never fine-tune again; many enterprises still must. Sometimes stuffing documents into a million-token window works beautifully. Sometimes performance gets worse as you retrieve more—more distractors, imperfect relevance judgments, a model that is still not omniscient.

The multimodal case is part of why everyone is racing toward long context: an image is a lot of tokens; video is a hell of a lot of tokens. But the garbage-in-garbage-out lesson from mixology still holds whether the garbage arrives through pre-training, RAG, or prompt stuffing.

Prompts Are Parameters

One of the episode’s most useful reframes: if you are working only through Claude or GPT-4, you are still training—you just optimized natural-language parameters instead of weights.

Josh describes building skills in natural language on OneReach.ai’s platform while still wanting a visual back-end to verify what landed. Frankle sympathizes and admits the agent-native generation may leave him behind. Robb describes fourth graphical UI whose entire purpose is constructing a written prompt—old media wrapped inside new media, Marshall McLuhan’s loop in product form.

The deeper point is continuity with computing history. Programs need specification, editable representation, and verification. Whether the “source code” is English, emoji, or algebra-as-prompt is secondary. The disciplines of software engineering—unit tests, integration tests, regression—do not exist yet for AI in any mature form. That is the blender sitting unplugged in the kitchen.

Published for Machines, Not Just Humans

The conversation closes on a thread that will matter to every brand and every enterprise content team: text published on the internet is no longer written only for human readers and search crawlers. It is training signal—directly or indirectly—for the next generation of models and agents.

Frankle expects an LLM optimization cottage industry, analogous to SEO. Static FAQ pages, curated canonical facts, deliberate separation of stable truth from tentative draft content—all of it becomes strategic. Unlock the PDFs, yes, but curate before you ingest. The mistake Robb names is familiar: dump decades of internal documents, including wrong answers buried in unread files, and suddenly your agent manufactures falsehood at scale.

Ideally we would separate knowledge from reasoning and plug fresh brand guidelines into a faithful reasoner. Frankle thinks we are far from that clean split—and until we get there, organizations live inside trillion-parameter uncertainty.

Back to the Mixologist

Read alongside our 2023 summary of Frankle’s first visit, this episode is the same scientist at a new scale. Mosaic’s mix-and-bake playbook has become Databricks’ customer conversations with twelve thousand enterprises. The metaphor upgraded from cupcakes to fusion—but the method is unchanged: measure what success looks like, start small, do not trust leaderboard scores, and remember that impact beats citation counts.

Frankle still measures his life partly by a pre-PhD report on police use of facial recognition that changed law—not by the lottery ticket hypothesis paper that made grad students excited. That value system shows up in how he talks to practitioners: less hype, more experiment design; less “AGI tomorrow,” more “what blender are we building next?”

Open the Ideation hub, or read the full episode transcript.