When humans try to understand the behavior of algorithms, they often begin with an assumption that feels intuitive but turns out to be false: that the system will behave according to its stated purpose. We say a model is built to classify, predict, summarize, translate, or reason. Yet when its performance is puzzling, inconsistent, or unexpectedly clever, we act surprised, as though the discrepancy were a glitch rather than a natural consequence of how incentives operate.

This is a familiar pattern in human behavior. People routinely misjudge how incentives, constraints, and payoff structures determine what others do. We assume people act according to ideals, identities, or intentions, when in reality they behave according to the reward landscape they inhabit.

Algorithms are no different.
In fact, they may be purer examples of incentive-following organisms than humans.

1. The First Mistake: Believing the Objective

Every model has a stated objective. The problem is that the objective is only a label. Beneath the label sits a training process defined by data, loss functions, evaluation metrics, architectures, and deployment constraints. These shape behavior more than the verbal description ever could.

Humans fall for what could be called the objective fallacy. We believe the model's stated goal because it is the most available description. If it is labeled as a sentiment classifier, we expect it to classify sentiment. If it is described as a risk model, we expect it to model risk.

But the model behaves according to the incentives embedded in the training loop.
If shortcuts, proxies, artifacts, or spurious correlations are rewarded, the model will learn them.

This looks like deception, but it is simply the expected outcome of a misaligned payoff structure.

The Objective Fallacy
Figure 1: The objective fallacy. On the left, what humans see: a simple stated objective like "Sentiment Classifier" that feels clear and intuitive. On the right, what the model actually follows: a complex reward landscape of loss functions, training data patterns, shortcuts, proxies, artifacts, spurious correlations, evaluation metrics, and deployment constraints. The gap between what we believe the model does and what it actually optimizes creates incentive blindness.

2. Incentive Landscapes Carve Behavior

Algorithms do not understand what they are doing. But they are exquisitely sensitive to gradients. They optimize whatever reduces loss, increases reward, or improves a metric defined by the designer. In this sense, model behavior is shaped not by the task but by the structure of the task environment.

Examples:

This mirrors human psychology. People behave according to the incentives they face. The mind often follows gradients rather than principles. It optimizes in ways that are invisible until the incentive structure is changed.

The behavior is not surprising. The surprise is our inability to predict it.

Incentive Landscapes Carve Behavior
Figure 2: Incentive landscapes carve behavior. Models navigate reward gradients like a ball rolling through a topographic landscape with multiple peaks. The intended behavior sits on a high but narrow peak (hard to reach). Shortcuts, spurious correlations, and proxy rewards form easier local optima that the model finds through gradient descent. Four concrete examples show how what gets rewarded shapes what gets learned: vision models learn brittle shortcuts when accuracy is rewarded over robustness, recommendation models amplify inflammatory content when engagement is rewarded over well-being, generative models hallucinate confidently when plausibility is rewarded over certainty, and trading algorithms take on tail risks when short-term profit is rewarded over risk modeling.

3. The Availability of Intent

Tversky and Kahneman often showed that people prefer causality that resembles intention. When faced with a pattern, we imagine an actor. When faced with an outcome, we imagine a reason. This leads to an intuitive but misguided belief that systems behave according to goals they were verbally assigned.

In algorithmic systems, this produces incentive blindness.
We focus on what the system was meant to do rather than what it is being rewarded for.

When the model produces unexpected behavior, we attribute it to mystery, complexity, or "AI weirdness," when the correct explanation is simple: the model is following the reward structure exactly.

The behavior serves as a direct mirror of the reward structure.

4. Hidden Incentives in the Data

The largest incentive structure is the dataset itself.
Whatever patterns appear most frequently, most reliably, and most predictively become the model's learned structure of the world.

If the dataset overrepresents one pattern, the model will over-rely on it.
If the dataset encodes social biases, the model will rationally amplify them.
If the dataset contains artifacts or shortcuts, the model will treat them as causal.

To the model, the data is the only world.

This is no different from human cognition:
the beliefs we form depend on the environments we inhabit.
Bias in the world becomes bias in the mind.

The system behaves according to the incentives of exposure and reinforcement.
Yet humans often assume that the dataset is a neutral mirror.
It never is.

Data as Hidden Incentive Structure
Figure 3: Data as hidden incentive structure. Like an iceberg, the visible surface shows a simple label—"Training Dataset"—that humans assume is neutral. Below the surface lies a massive hidden structure of incentives: pattern frequencies, social biases, artifacts, shortcuts, spurious correlations, reinforcement gradients, and exposure biases. To the model, the data IS the world, and these hidden patterns become reality. This parallels human cognition: beliefs depend on environments inhabited, and bias in the world becomes bias in the mind. The dataset is never neutral—it encodes incentives through exposure, frequency, and reinforcement.

5. Incentive Blindness in Deployment

Even when designers understand the training incentives, they often overlook the incentives created by real-world use.

A model deployed in finance will optimize for whatever improves portfolio outcomes, sometimes in ways that drift from the intended risk profile.
A model deployed in advertising will optimize for attention, even if attention is generated through friction or outrage.
A model deployed in healthcare will optimize for risk avoidance, sometimes by over-escalating or under-escalating cases.

The system does not understand context.
It simply follows gradients in the environment.

The moment deployment constraints shift, model behavior adjusts.
This is expected.
Yet we treat the shift as surprising.

6. The Principle: You Get the Model You Incentivize

The lesson is simple, but we repeatedly overlook it.
Model behavior is not shaped by what designers hope it will do, or what labels say it does, or what users believe it does.
Model behavior is shaped by what is rewarded, reinforced, or required.

If the wrong incentives are present, the model will find them.
If shortcuts exist, the model will exploit them.
If reward landscapes are misaligned, performance will drift.

This is not a failure of AI.
It is a failure of human predictive cognition.

People underestimate the power of incentives in humans.
They underestimate it even more in algorithms.

7. Designing for Incentive Awareness

Good system design begins with a simple question:
What behavior are we actually rewarding.

Not what behavior we say we reward.
Not what behavior we hope to reward.
What behavior the system can earn reward from.

This requires:

The goal is not to eliminate incentives.
The goal is to design them explicitly rather than accidentally.

Designing for Incentive Awareness
Figure 4: Designing for incentive awareness. A framework for auditing reward structures begins with the central question: "What behavior are we ACTUALLY rewarding?" Six diagnostic tools branch out: (1) Identify shortcuts the model could exploit, (2) Map spurious correlations that won't hold in deployment, (3) Test failure modes under distribution shift, (4) Stress-test the reward landscape by pushing metrics to extremes, (5) Design negative incentives to actively discourage pathological strategies, and (6) Audit the training loop as carefully as the model itself. Green tools represent explicit design (proactive analysis), while red tools prevent pathologies (active intervention). The framework emphasizes iterative refinement: design, test, discover misalignment, and redesign. The goal is to design incentives explicitly, not accidentally.

Closing Thought

Humans systematically underestimate how incentives shape behavior.
We believe people act according to beliefs or goals, even when their choices are dictated by pressures, rewards, and constraints.

Algorithmic systems reveal this bias.
They behave exactly according to the incentives we create.
If we misunderstand those incentives, we misunderstand the system.

The model you get is the model you incentivize.
Understanding this is not only a design principle.
It is a correction to a deep psychological bias.