In 2023 Jonathan Frankle joined Invisible Machines as co-founder and chief scientist of MosaicML—months before Databricks acquired the company for $1.3 billion. The conversation became one of the show’s most practical episodes on what it actually takes to train and deploy language models inside enterprises.
This article summarizes that visit. It is not a transcript. For the full S7E11 return conversation with Frankle as Chief AI Scientist at Databricks, see Nuclear Fusion, No Power Lines.
Everyone Is a Scientist Now
Frankle’s through-line then and now: machine learning laws still apply. Train on Reddit and you get Reddit. Train on scientific papers and you get Galactica— brilliant inside its domain, brittle outside it. Pre-training, instruction tuning, and RLHF each bend the model toward human preferences, sometimes at the cost of calibration.
His prescription was never “trust the vendor.” It was climb the ladder. If prompting works, stop. If a vector database truly solves retrieval, great. If not, fine-tune a little, then a lot, then pre-train if you must. Multilingual models often require training from scratch. The ladder is the framework.
The Mixologist Exercise
The episode’s memorable set piece is a live quiz. Frankle offers four datasets—scientific papers, GitHub code, web scrape, Wikipedia—and asks Robb and Josh for mixing proportions. Equal splits? Risky. Wikipedia is tiny relative to the web; repeat it too often and you memorize. Code may dominate a general-purpose model you do not need.
Frankle played “data mix master” for Mosaic’s own releases— deliberately taking accountability so his CEO would yell at him, not his team, if the blend failed. His honest conclusion: without metrics that reflect real use, you de-risk by copying sensible peers, praying, and baking mini-cupcakes before the full cake.
Benchmarks vs. Reality
He screen-shared HellaSwag—a legitimate academic benchmark—and read a multiple-choice commonsense question about a woman, a bucket, and a dog avoiding a bath. Useful for research. Not how people use ChatGPT.
The chip-fabrication analogy landed hard: training a large model is so expensive that you simulate, stage, and verify before tape-out—then say a prayer. GPT-4-scale training with human evaluators saying “this sucks” afterward would be catastrophic capital destruction. Measurement discipline is not optional; it is the business.
Betty Crocker for LLMs
Mosaic’s product metaphor was the easy-bake mix. Enterprises bring the egg and water— their proprietary data and domain choices—while Mosaic supplies a base that will not poison the batter. Start with mini-cupcakes, then cupcakes, then the wedding cake. Replit training a state-of-the-art 3B code model in three days on Mosaic tooling was the proof point Frankle was proudest to cite.
Talent, Lock-In, and Balance
On talent: hyperscaler pedigree is BS for most roles; brilliance is distributed; the rare people who repeatedly ship in AI are scarce but not confined to DeepMind alumni. On lock-in: GPT-4 gives you no weights, no portability, and synthetic-data terms of service that require a lawyer. Open weights and owned models let you distill, migrate, and experiment.
On philosophy: Frankle is an incrementalist and a balance-seeker—AGI will matter, it will not kill us tomorrow; use GPT-4 and custom models; websites will not vanish but agents will matter; extremes sell funding decks but mislead buyers.
On impact: he still ranks a facial-recognition policy report that changed law above his lottery-ticket dissertation in citations. Science should leave the world different—not just excite other scientists.
Why It Still Matters
Every theme in that 2023 hour—mixology, evaluation, ladder-climbing, crown-jewels data, curator-not-dumper—shows up again in Frankle’s 2025 return as infrastructure metaphors. The vocabulary changed from cupcakes to fusion; the obligation to do rigorous, humble, small-bet science did not.
Open the full ideation cluster · Read the S7E11 companion essay