RLHF: Where AIs learn to make everything the same

Professional reasoning is specific and strategic. AI training teaches the generic and the typical.

In a typical big law firm, there is one–or a couple of–“superstars”. Their time is billed at three times the amount of a regular partner. They have decades of experience practicing the law, and they’ve learned its “inner game.” Their case filings are not–on the surface–particularly impressive: they may cite unusual precedents, or neglect what seem like obvious paths to a “clean” filing. And yet, these guys win their cases disproportionately.

What the document actually is, of course, is a precision instrument. Every apparent “anomaly” is deliberate. A disproportionate argument targets a known vulnerability in opposing counsel’s relationship with their client. The seemingly tangential precedent creates a reputational exposure that the other party’s senior partner cannot afford. The tone is calibrated to target the opponent’s known aversion to protracted litigation. The filing is not just a legal document: it is a tactical and psychological operation conducted through the medium of the page.

This is kind of the path of the genuine expert. They know the usual forms, but they also understand when the usual forms will not win the day. The are informed by their years of experience, both successes and failures, and have developed a way of attacking a case that is deeply informed by their understanding of the case and its merits, the stakes, the defendant and his lawyer, the pressures, financial, tactical and emotional, that might become effective levers towards settlement.

And this document could never be used to train an AI. And that, in miniature, is the problem.


The AI industry’s current dominant method for improving model behavior is called Reinforcement Learning from Human Feedback — RLHF. The mechanics are straightforward: human evaluators review model outputs, rate them, and the model is adjusted to produce more of what gets rated highly. It is, at its core, the systematic collection of human preference as data. The industry has invested enormously in it, and what it has trained for is something different than innovation or deep strategy. What raters consistently prefer is the typical and the verbosely-elaborated: commonly-seen solutions to commonly-encountered problems. Models trained this way are fluent, helpful, but never capable of genuine professional and strategic reasoning. Because strategic reasoning is bespoke. The models are being trained to create the output of the entry-level paralegal.

What RLHF actually trains for, is formal, fluent readability. Because that is what the trainers prefer.

The evaluators who rate model outputs are not, for the most part, senior partners at law firms or principal architects with twenty years of production experience behind them. They are gig workers, recruited at scale, working under time pressure and performance anxiety for below-market wages, rating outputs against rubrics that ask whether the response demonstrates “best practices,” whether it “looks professional,” whether it deploys the current standard toolkit of the relevant domain. But what makes professional reasoning different from template-filling is that it is strategic and specific. It proceeds particular body of experience and knowledge, considers trade-offs and alternative paths, understands the people and politics involved, and is mindful of the budget, goals, capabilities and constraints of the particular set of circumstances, and makes trade-offs from these.

For routine outputs — explanations, summaries, standard document drafts, code that implements well-specified requirements — generalized approaches towards a business or professional domain is often is adequate. The problem emerges when the target shifts to what practitioners in high-stakes fields actually do when they are doing their best work. Because their best work is almost never just the application of “industry best practices”. It often requires calibrated and strategic departures from best practices.

There is a standard trope offered in Creative Writing programs about grammar and form. You must know, and understand, the rules, in order to break them well. It is thus in almost every professional domain.


Consider what a senior data architect actually produces. Not just diagram, or technical specification — those are the artifacts, the terminal outputs, the things that can be collected and rated. What the architect actually produces, before any of that, is a theory of the situation: who commissioned this, and why; whose legacy system is being displaced, and how that person’s resistance will manifest; what the budget constraint really is versus what it’s stated to be; which of the proposed technologies will still be supportable in three years and whether the current staff’s level of skill and experience is adequate to the technical design and maintenance of the design choices being considered.

The document that results from this theory is not a generic best-practice architecture. It may well omit the comprehensive security framework that any rubric-based reviewer would expect to see — because the client is a small, non-networked operation for whom that framework represents not rigor but an unimplementable budget overrun. The omission is the expertise. But to an evaluator without the context, the omission looks like a gap. The training signal reads the expert output as deficient. The model learns, iteratively, to include the security framework regardless of context, because its absence has been penalized enough times that avoidance becomes structural.

This is not noise in the training process. It is systematic inversion at the high end — a mechanism that actively learns to suppress the features that make expert outputs expert. The most contextually fitted artifacts are the ones most likely to violate generic rubric expectations, which makes them the most likely to be down-rated, which makes the model least likely to produce them. The training process has no way to detect this. The scores look fine. The benchmarks are satisfied. The capability that has been trained away leaves no visible gap in the metrics, because the metrics were not built to see it.


There is a deeper problem than evaluation quality, however. It has to do with what professional artifacts actually are — what they contain and, more importantly, what they do not.

Professional documents are not records of the reasoning that produced them. They are, almost universally, records of the conclusions that reasoning reached, written in a form appropriate to their intended audience. The legal filing does not explain that it is targeting opposing counsel’s settlement threshold. The architectural specification does not explain that the straw-man alternative was included specifically to make the recommended design look more conservative by comparison — a shadow game, a rhetorical instrument disguised as a genuine alternative. The design document addressed to the technical committee does not explain that it is simultaneously managing a political relationship with a skeptical executive three levels up.

These suppressions are not incidental to professional communication. They are its defining feature. A document that exposed its own strategic logic would, in most professional contexts, be a failure. The expertise is in the execution; the exposure of the reasoning would undermine the execution. Which means that every training artifact collected from genuine expert practice is, by construction, a document from which the most important information has been professionally removed.

You cannot reconstruct a causal chain from its terminus. The artifact is the terminus. The reasoning that produced it — the theory of the situation, the model of the audience, the calibrated judgment about which conventions to follow and which to depart from and why — is opaque. Worse, in the case of the most sophisticated outputs, the artifact actively conceals the considerations and judgements that are fundamental to their shape. . Training on these artifacts does not teach professional reasoning. It teaches the surface conventions of professional communication: how the document should look, what elements it should contain, whether assertions are qualified and hedged, and whether paragraphs align neatly. Genre competence, perhaps, but not strategic competence. Not the thing that makes the difference between a document that is effective and thoroughly considered, and a document that merely looks like one that should.


Every genuine professional understands this distinction from the inside, though they rarely articulate it in these terms. What they usually say, if pressed, is something about experience — about having seen enough situations, made enough mistakes, observed enough of the gap between what a document said and what the world did in response, to have developed a feel for what a situation actually requires. This feel is not mystical. It is an accumulated prior, built from years of consequence-exposure, that shapes perception before deliberation begins. The senior architect does not think harder about the resource-constrained client. They recognize the situation type immediately, weight its constraints accurately, and reach the calibrated judgment with less conscious effort than a junior practitioner would spend reaching the wrong one.

This prior is the product of what professional formation has always actually been, in every high-stakes domain from law to medicine to military command: apprenticeship. Not instruction, not coursework, not the study of exemplars — apprenticeship, which is specifically the extended exposure to the reasoning of a more experienced practitioner, in real situations, with real consequences, close enough to observe not just what decisions were made but why, and what happened afterward. The feedback loop that drives the asymptotic curve toward sound judgment is not the reading of artifacts. It is the lived experience of outcomes — of the project that collapsed, the filing that backfired, the design that was technically elegant and organizationally impossible. Artifacts do not contain outcomes. They precede them.

What co-reasoning with an expert makes available — what no other training mode can replicate — is the externalization of the prior. When an expert walks through a case, explaining not just what they would do but why, in the specific terms of this client’s constraints and this stakeholder’s motivations and this moment’s political configuration, they are making visible something that the artifact alone never exposes: the chain of conditional judgments that connects the theory of the situation to the document the situation warrants. This is not the same as instruction. It is closer to what happens in a good apprenticeship, when the senior practitioner thinks out loud in front of someone who is learning to think the same way.


There is one further dimension that the artifact-versus-reasoning distinction does not quite capture, and it may be the most important one. The best practitioners in any high-stakes domain are not simply experienced. They are, in a meaningful sense, philosophers of their domain — people who have accumulated enough cases, and reflected on them with enough rigor, that they have developed a working theory of what the domain actually is, what it is for, and what kinds of solutions it genuinely admits. The cargo cult rejection is a philosophical act. The straw man is a philosophical instrument. The calibrated omission of the security framework is not just a judgment about this client’s budget — it reflects a theory that professional practice is the fitting of means to specific ends in specific contexts, not the application of conventions to problems that have been defined by the conventions.

This philosophy does not surface in artifacts. It rarely surfaces in explicit instruction. It surfaces, when it surfaces at all, in the pressure of genuine engagement — when the expert is pushed past the explanation of what they did and into the territory of why they reason the way they reason. The meta-level. The level at which the catalog of experience becomes a continuously revised theory of the domain, tested against each new case and refined by each outcome.

No current training architecture reaches this level. RLHF collects terminal outputs from a process that begins in philosophy, passes through accumulated prior, and reaches the artifact only at the end. The philosophy is invisible. The prior is invisible. The artifact is all that remains — and it is, paradoxically, the least informative thing about the expert who produced it.

The industry is building systems intended to match or exceed human expert performance by collecting the residue of “expert” practice and optimizing toward it. What they are actually collecting, in case after case, is the surface of a process whose depth they have no instrument to measure. The most expert outputs — the ones most tightly fitted to suppressed context, most philosophically grounded, most precisely calibrated to specific human situations — are the ones that RLHF practices are least equipped to recognize and most likely to penalize.

- Advertisement -spot_img

Related

- Advertisement -spot_img

Latest article