Model on the Metal: Harness as the New Enterprise Paradigm 

2022 was the year large language models entered mainstream consciousness with a bang. It is 2026, and a realisation is settling in – unevenly, reluctantly, but unmistakably: for a rapidly growing number of tasks, you do not need software in the traditional sense.

In this article, we argue that the dominant enterprise AI architecture is converging on a minimal pattern – a frontier model running in a loop with direct access to tools and file-based memory – and that this “harness” paradigm supersedes the pipeline and agent-graph approaches that preceded it. We trace the technical and organisational trajectory from 2022 to 2026, examine how governance is not weakened but strengthened – with the standard operating procedure itself becoming the machine-executable instruction – and outline how this reshapes the enterprise. At PwC, we are already building on this foundation – including a multi-model collaboration through our Enterprise Instinct platform in Canada. 

The quiet revolution 

Software has become malleable – written on demand, on the fly, by the model itself.  No graphical user interface, no dashboard, no workflow builder, no app. The model is the interface. Frontier models have achieved a scale at which, as long as you keep them running and give them access to the right systems, they can solve a task of arbitrary complexity – iterating, recovering from failed attempts, and course-correcting on their own. 

This has profound implications for how enterprises build, buy, and think about technology. The paradigm is shifting – as significantly as when the spreadsheet transformed accounting, or the database transformed record-keeping. But the shift is not just technical. It demands a different way of thinking about what to automate, how to govern it, and how to reorganise around it. To understand why, we need to walk through how we got here – because the path was anything but straight. 

The timeline everyone lived through

In 2022, ChatGPT became the fastest-adopted technology in history. The shock of having a computer talk back at us intelligently was profound. At first, enterprises were cautious – ‘can we call on these alien, external systems?’ – but soon every boardroom wanted an AI strategy, every consultancy scrambled to define one. It became immediately apparent that feeding a model additional text to synthesise (RAG) would be immensely practical in a wide variety of ways. The models were good, but not good enough. Experimentation had begun. 

By 2023 and into 2024, excitement grew around “integrations”. OpenAI introduced function calling – the ability for a model to not just generate text, but to invoke structured operations: query a database, call an API, trigger a workflow. A moment when the industry shifted from fascination to engineering. Frameworks like LangChain emerged, promising to wire language models into existing systems through chains of pre-defined steps. The air was thick with proofs of concept. Great blog posts. Enterprises ran pilots. The architecture of choice was the pipeline: prompt in, structured steps, result out. 

In parallel, a more ambitious vision took hold. OpenAI released its Assistants API. Agent swarms held attention for a while – divide and conquer any task by letting many agents hand off sub-tasks to one another. LangGraph appeared, offering developers the ability to build complex multi-agent systems with branching logic and shared state. The prevailing wisdom was clear: to achieve complex outcomes, you needed complex orchestration. You needed an agent graph – nodes and edges, each one a model doing a sub-task, passing results forward. The industry built elaborate scaffolding around what were, at their core, text-prediction engines. 

Then 2025 arrived. And something broke – in the best possible way. 

The breaking point 

The frontier models of 2025 crossed a threshold. They became capable enough, and reliable enough, that the scaffolding started to feel like handcuffs. Developers noticed something counterintuitive: the simpler the wrapper around the model, the better it performed. Strip away the orchestration graph. Remove the rigid step-by-step pipelines. Just give the model a task, a set of tools, and let it run. It would figure out the rest. 

This was not a theoretical insight. It was an empirical one, discovered by the engineers themselves building real systems. The architecture that emerged has come to be known as the harness – and it is, we believe, the winning pattern for the next era of enterprise technology. 

The framework tax

To appreciate why the harness has emerged as the winning architecture, it helps to understand what came before. 

In what we might call Phase I of enterprise AI – roughly 2023 to early 2025 – the dominant approach was to build abstractions on top of abstractions. The model was treated as an unreliable component that needed to be contained, channelled, and supervised by external logic. We built orchestrator engines. We built agent graphs with nodes and edges. We defined rigid flows – a rigid, predefined pipeline, what engineers call a DAG, a directed acyclic graph – where data flows in one direction through a fixed sequence of steps: if the model says X, go to step Y; if confidence is below threshold Z, escalate to a human. Every edge case meant rewiring the graph. Every new requirement meant a new node. 

The reasoning was sound at the time. The models of 2023 were unreliable. They hallucinated. They lost track of instructions. They needed rails. And so the industry built them – intricate multi-agent architectures with dozens of nodes, carefully defined handoff logic, and deterministic routing. The result was a new class of application where the orchestration layer dwarfed the model itself. Change one business rule and you were rewiring the graph. Add a new data source and you were re-architecting the pipeline. The framework became the product, and the model became a replaceable component within it. 

But as the frontier models scaled up through 2024 and 2025, an uncomfortable truth emerged: the framework had become a tax on the model’s abilities when unrestrained. The models were now capable of handling the complexity we had been offloading to the orchestrator. Worse, the rigid structure of the graph was preventing the model from exercising its full capability. We had built a cage for an animal that had outgrown it. 

Model on the metal

The harness is the opposite philosophy. It is the thinnest possible layer between a frontier model and a real operating environment – file system, terminal, APIs, databases. No orchestration framework. No agent graph. No directed acyclic anything. Given an instruction and a task, the model designs its own workflow. It is called in a loop, equipped with memory and whatever system access it needs, and left to work – iterating, recovering from missteps, and course-correcting until it reaches the goal. 

That is the entire architecture. 

Consider what this looks like in practice. A harness implementation is, at its core, a remarkably small piece of software. A system prompt – written in markdown. A set of tool definitions: the ability to read and write files, execute shell commands, search the web, query a database. And a loop that calls the model, executes whatever tools it requests, feeds the results back, and repeats. The entire execution engine can fit in a few hundred lines of code. There is no graph to maintain, no workflow to rewire, no orchestration layer to debug. 

The genius of the harness is in its primitives. Markdown. YAML. Plain text files arranged in a structured hierarchy – instructions, configuration, memory – that the model can traverse, update, and iterate on just as a developer navigates a codebase. A file system exposed to the model is a superior replacement for a database or a vector store: the native medium of a system that was trained on text. Give a model a directory of well-organised files and it will orient itself the way you orient yourself in a familiar folder structure – instantly, intuitively, and with full context. 

This is well expected. A Recent paper explores how language models perceive text spatially. Investigating how Claude predicts line breaks in fixed-width text, the researchers found that the model builds, from nothing but next-token prediction, an internal geometric map of where it is on a line – tracking character position as a helical manifold in high-dimensional space. The researchers draw an explicit parallel to biology: bats develop echolocation, migratory birds sense magnetic fields, Arctic reindeer shift their ultraviolet vision seasonally.

Each species evolves the perceptual apparatus its environment demands. Language models, whose environment is text, have evolved an innate sense of spatial structure – line position, indentation, the rhythm of headers and bullets, the geometry of a markdown document. This is why the interface between human intent and machine execution does not need to be code, or a complex schema, or a visual workflow builder. It can be a pure text file referencing other text files in a logical structure. And that is exactly what the harness exploits. 

This has a second-order consequence that is easy to miss. If one model can navigate a file system with native fluency, so can two. Or ten. The same text-based structure that gives a single harness its memory – markdown instructions, YAML state files, structured logs – becomes a shared memory when multiple models need to collaborate. One model writes its findings to a file. Another reads it, picks up where the first left off, and continues. No handover protocol. No serialised tool-call state passed between agents. No confused context windows trying to reconstruct what a prior agent did from a summary. Just files. Breadcrumbs left in plain text that any model can read as naturally as the first one wrote them. This is why agent teams built on a file-based memory layer work – and why the earlier generation of multi-agent swarms, which relied on rigid handoff logic and tool-call chains, so often did not. The coordination medium is the same medium the models already think in. 

The developer community recognised this almost immediately. Throughout 2025, the tools that gained the most traction were not the complex multi-agent frameworks. They were the harnesses: minimal shells around frontier models that trusted the model to drive the process. And the applications that emerged were telling. Freelancers who had been stitching together invoicing with macros and manual data pulls – logging into AWS, downloading cost reports, cross-referencing hours – now maintained an instructions sheet, gave the model API access, and asked. The way you would ask a secretary. On Twitter, solo operators shared examples where a model connected to their CRM, pulled leads, drafted outreach sequences, and logged responses. No GUI. No drag-and-drop workflow builder. Not even a no-code tool. Just a person who knew what they wanted, knew which services had APIs, and described the workflow in plain language. The model handled the rest – connecting to the APIs, executing the steps, recovering when something broke. What used to require a pipeline of integrations now required a paragraph of instructions. 

What this means for the enterprise 

The implications for enterprise and knowledge work are significant and, we believe, positive. 

Consider a concrete example. A business needs to review incoming invoices against supplier contracts – verifying line items, checking pricing against agreed rates, flagging discrepancies, and routing exceptions for approval. This is bread-and-butter enterprise work. 

Under the old paradigm, you would build a DAG. Step one: extract data from the invoice PDF. Step two: query the contract database. Step three: compare line items against contractual terms. Step four: apply a set of business rules – if the variance exceeds 5%, flag for review; if the supplier is on a preferred list, apply a different threshold. Step five: route the result to the appropriate approver. Every step hard-coded. Every rule crystallised in application logic. When management decides that the 5% threshold should be 3%, or that a new category of spend requires a different approval chain, a developer rewrites the pipeline. Slow and costly. 

Under the harness paradigm, the architecture looks radically different. The model receives a set of instructions – in plain text, written by the business – that describe the review policy. It has access to tools: a PDF reader, a contract database query, an approval workflow API. When an invoice arrives, the model reads it, pulls the relevant contract, compares the terms, and applies the policy as described in its instructions. If something is ambiguous, it flags it. If a line item does not match any contractual category, it says so and explains why. 

When management changes the policy, someone edits a text file. No developer required. No deployment pipeline. No regression testing of a rebuilt application. The instructions change, and the model’s behaviour changes with them – immediately. 

But what about governance?

Let us address this directly. Enterprises have spent billions over the past fifteen years building compliance and audit frameworks around linear processes – every step documented in a flowchart, every decision node traceable, every handoff recorded. Then they spent billions more translating those standard operating procedures into software – hardcoding the SOP into application logic so it could be enforced and audited digitally. The while loop feels like it breaks that contract. If the model is choosing its own path, how do you audit it?

How do you govern what you cannot predict? 

But here is what most people miss: the SOP is the harness instruction. The same document that a compliance team writes to describe a process – the policy, the rules, the exception criteria – is, almost verbatim, the instruction file the model follows. There is no translation step. No developer reinterpreting business intent into code, no semantic friction. The process description and the process execution are the same artefact. This is not a minor convenience. It collapses the most expensive, error-prone gap in enterprise IT: the distance between what the business means and what the software does. 

Governance is not bolted on after the fact. It is engineered into the tool layer itself. Three mechanisms make this concrete. 

First: tool-tier gates. Not all tools are equal. A read operation – querying a database, pulling a document – executes freely and is logged. A low-stakes write – drafting an email, updating a field – executes and is sampled for review. But a high-stakes action – approving an invoice, modifying a contract, transferring funds – does not execute at all. The tool blocks. The model proposes the action, explains its reasoning, and waits. A human reviews, approves or rejects, and only then does the loop continue. The model literally cannot approve its own invoice. The gate is in the tool definition, not in the model’s behaviour. 
Second: headless execution with proposal queues. The harness does not need a human watching in real time. It can run in batch mode – overnight, on a schedule – processing hundreds of items and producing a structured proposal document: here is what I analysed, here is what I recommend, here is why. No action taken. The queue sits there until a human opens it the next morning, reviews the recommendations, and approves what should proceed. This maps directly onto existing enterprise approval workflows. The model does the cognitive work. The human retains the authority. 
Third: a richer audit trail. This is the counterintuitive part. The while loop is more auditable than the old DAG, not less. In a traditional pipeline, your audit trail says: step 3 passed, step 4 flagged. You know what happened. You rarely know why. The logic is buried in application code that auditors cannot read. In a harness, every iteration of the loop captures the model’s reasoning in natural language: “I compared invoice line 4 against contract clause 7.2 and found a 3.2% variance, which is below the 5% threshold defined in policy document AP-2024-03, so I am marking this as compliant.” The audit trail reads like a memo, not a log file. Every decision is explained, referenced, and traceable. And because the full memory trail is preserved, a human can interrogate the model after the fact – ask it why it made a specific decision, walk it through its own reasoning, challenge its assumptions. It is like being able to look inside the head of the person who did the work. No traditional system has ever offered that.

This is not speculative. It is being built today. 

Knowledge work is a while loop 

Zoom out from invoices and the pattern becomes clear. The harness architecture permits the automation of workflows that were previously too complex, too variable, or too expensive to codify in traditional software. 

Think about what a knowledge worker actually does. Read an email, decide, write a response. Read a spreadsheet, think, update the spreadsheet. Pull data from system A, transform it, put it in system B. Read requirements, plan, execute steps.

Every task, stripped to its essentials, is the same three-beat cycle: read from one data layer, apply cognitive processing, write to another.

And that cycle does not run once – it repeats, open-endedly, until the goal is met. Not for each step in the plan, but while the problem is unsolved. And in almost every case, that cycle is mediated by a graphical user interface that a human clicks through manually. 

The cognitive effort – the tax – sits in the middle step. The reading and writing are mechanical. The thinking is where the value is, and also where the bottleneck is. A person can only hold so much context. They get tired. They miss things. They are slow. 

A frontier model inside a harness collapses this cycle. It reads from system A through an API. It synthesises, interprets, and applies judgment – drawing on a context window that can hold entire contracts, entire codebases, entire policy documents at once. It writes to system B through another API. It loops – not through a predetermined sequence of steps, but until the task is done. The cycle that took a knowledge worker an hour takes the model seconds. And it does not get tired. 

But this does not mean the human goes away. This is a critical point, and one we want to be precise about. 

For routine, well-defined workflows – invoice review, regulatory filing checks, data reconciliation – the model will increasingly operate autonomously, with human oversight reduced to a control function. A person reviews a sample, audits the logs, confirms the model is performing within policy in a human-in-the-loop structure. The harness simply extends that principle to cognitive work. 

But for novel, ambiguous, or strategically significant challenges – the ones that require creativity, critical judgment, or the kind of thinking that only sharpens through friction – the human does not step back. The human steps up, becoming a manager of multiple harnesses. One model is researching a market entry question. Another is drafting a regulatory response. A third is analysing a dataset for anomalies. The human sets direction, reviews output, redirects when the model drifts, and synthesises across workstreams. The role shifts from doing the cognitive work to directing it. 

Left alone, models spiral or drift. What a human provides is not just input but asymmetry – the ability to make some outputs matter more than others, to hold a stake in the outcome. A model, regardless of its scale, is a forward pass. A human is the one who decides what the forward pass is for. 

The spreadsheet moment 

We have seen a transition like this before. 

Before VisiCalc launched in 1979, the word “spreadsheet” referred to a physical object – a large sheet of columnar paper on which accountants performed calculations by hand. Picture Jack Lemmon in The Apartment: a lone clerk in an endless grid of identical desks stretching to the vanishing point, each person hunched over a workbook with an adding machine and a pencil. The building itself was the computer. Each person was a cell in a living spreadsheet. When one number changed – a revised sales forecast, an updated tax rate – a signal propagated across floors and departments as every dependent calculation was redone by hand, cascading across pages of ledger paper. Someone at the top floor hit “recalculate,” and the entire human machine ground into motion. 

VisiCalc, and later Lotus 1-2-3, did not merely speed this up. They made it unrecognisable. A single analyst with a personal computer could now do in minutes what had taken a department days. The profession of accounting did not disappear – it transformed. The mechanical recalculation vanished. What remained, and what grew, was the judgment: the interpretation, the strategy, the advice. The tool eliminated the drudgery and amplified the thinking. 

We are standing at an equivalent moment. A few years from now, we will look back and find it remarkable that knowledge workers once shouldered the entire cognitive burden themselves – reading every document, cross-referencing every data point, drafting every analysis from scratch. The harness does for cognitive work what the spreadsheet did for calculation: it collapses the mechanical effort and leaves the human free to do what humans do best. 

Each of us is about to multiply our cognitive reach, directing several model instances the way a conductor directs an orchestra – not playing every instrument, but shaping the music. 

Where we go from here

At PwC, we have begun building on a harness architecture of our own, exploring how it can be applied to the complex, high-stakes workflows our clients depend on. At PwC Canada, this work has taken shape as Enterprise Instinct – a platform built on the principles described in this article, including agent teams that coordinate through shared, file-based memory rather than rigid orchestration. Results are encouraging. The pattern’s simplicity is not a weakness; it is what makes it auditable, governable, and adaptable to the regulatory environments our clients operate in. 

We do not think harness-based systems will replace every enterprise application overnight. Legacy systems, regulatory constraints, and organisational inertia are real forces. But the direction is clear. The era of building elaborate software to contain and control AI models is giving way to something leaner and more powerful: letting the model work, giving it the tools it needs, and focusing human energy on what actually matters – knowing what to ask for, learning to think alongside a model, and scaling that interplay across an enterprise.