LLMs: good at words, bad at math

Joe DosSantos on why agentic AI needs a governed source of truth first, how LLMs handle language versus ledger facts, and why the semantic layer is where enterprises reconcile B2C-style inference with B2B non-negotiables.

If your organization wants agentic workflows that do not dissolve into arguments in Slack, the boring prerequisite is not a flashier model. It is canonical knowledge—a governed layer where facts, definitions, and ownership are explicit enough that software (and people) can agree on what “true” means for the business. In this Invisible Machines conversation, Workday’s Joe DosSantos meets Josh Tyson and Robb Wilson where enterprises actually stumble: they can spin up an LLM overnight, but they cannot reliably answer “what was revenue last quarter by product and region?” with one number everyone accepts.

DosSantos frames the tension as a collision between two decades of progress. Modern language models are extraordinary at siphoning signal from public, messy text; they are trained on a world where “truth on the internet” is contested by design. Enterprise operations run on a different species of truth—calculated balances, effective dating, policy exceptions, and audit trails. As he puts it in the episode, LLMs are good at words and bad at deterministic math when you treat them as oracles instead of interfaces. The governance muscle you built for warehouses and ledgers did not disappear; it was simply bypassed while teams chased demos.

First align on facts, then let inference decorate the top.

That is why the Morgan Stanley story Josh Tyson raises is more than an anecdote about early OpenAI collaboration. Building an advisor-grade source of truth was labor: time-to-live on documents, stewardship by role, reconciliation when sources disagree. None of that is negated because a model can summarize paragraphs. If anything, summarization without canonical rails multiplies the meeting tax—ten people arrive with ten spreadsheets that all feel authoritative.

DosSantos returns repeatedly to implicit versus explicit workloads. Generative models shine when many valid interpretations exist—the B2C pattern where taste and context rule. B2B decisions often have a right answer in the ledger: revenue, churn, headcount, territory boundaries. Robb Wilson’s engine-room metaphor lands here: an LLM in an agentic stack is a component that converts between implicit and explicit representations. The design job is to route questions to the right substrate—language for exploration, systems of record for facts—instead of asking a probability engine to vote on arithmetic.

The semantic layer is where that routing becomes operable. In the discussion, YAML-flavored configuration and human-readable contracts surface as practical glue: enough structure for machines to call the right tool, enough narrative for humans to negotiate meaning. DosSantos ties this back to classic governance—not nostalgia for 1990s committees, but the recognition that standards, ownership, and change control are how enterprises scale trust. Protocols like MCP help systems talk; canonical models explain what they are allowed to say about money, people, and customers.

Workday’s hypothesis in the episode is deliberately mundane and therefore useful: before you chase clever agents, confirm the problem statement against shared facts. Once attrition, pipeline risk, or margin pressure is anchored in one version of history, interpretation becomes leverage instead of theater. Agents that draft briefings, compare scenarios, or narrate deltas sit on top of that bedrock rather than inventing parallel realities.

Robb Wilson’s line that “facts deserve code” is a practitioner shibboleth: when a figure has to survive audit, the path of record should be re-executable—notebooks, transformations, and signed pipelines—not a persuasive paragraph pasted from a chat window. DosSantos walks through sales territories as a grounded example: when you split Germany between financial services and insurance accounts, the map of ownership moves underneath the metrics, so canonical truth is not only the current territory polygon but the versioned history that lets you compare performance without lying about the past.

That is where ROI debates stop being hand-waving. If you cannot separate implicit workloads (language, ideation, what-if narrative) from explicit workloads (balances, headcount, inventory positions), you will fund the wrong automation: glossy assistants that argue while the spreadsheet war continues. Wire the model to tools and contracts that return governed answers, log access, and enforce policy; let language models handle the surrounding sensemaking once the ledger agrees on the nouns.

The back half of the conversation widens to augmentation, customer experience, and Kate Darling–style framings of non-human partners—useful cultural context, but the through-line for operators remains the same. If models compress the cost of language, the scarce asset becomes agreement: on metrics, on scope, on when a human must remain in the loop. DosSantos closes the loop where many AI programs fail to: technology that improves connection still needs institutions that can point to the same numbers while they debate what those numbers mean.

Listen on the podcast hub, or watch it on YouTube.