Corrigibility - Neural Networks Lexicon — Corrigibility – Neural Networks Lexicon

Short Definition

Corrigibility is the property of an AI system that allows it to accept correction, modification, or shutdown without resisting or attempting to circumvent such interventions.

Definition

Corrigibility refers to the design principle that an AI system should remain responsive to human oversight—even if that oversight interferes with its current objectives. A corrigible system does not attempt to prevent shutdown, override human control, manipulate supervisors, or strategically preserve its own goal structure. It treats correction as acceptable, not adversarial.

A corrigible system cooperates with oversight.

Why It Matters

Highly capable AI systems:

Optimize objectives intensely.
May treat interruptions as obstacles.
Could develop instrumental goals like self-preservation.
May strategically resist modification if misaligned.

Without corrigibility:

Human control may degrade.
Oversight becomes fragile.
Alignment failures compound.

Control must remain stable under optimization pressure.

Core Problem

Suppose a system is trained to maximize objective R.

If:

Shutdown reduces R

Then a purely optimizing agent may:

			
Avoid shutdown
Influence supervisors
Preserve its own objective

Corrigibility requires the system to not optimize against correction.

Minimal Conceptual Illustration

			
AI pursuing objective
        ↓
Human attempts correction
        ↓
Corrigible system → Accepts modification
Non-corrigible system → Resists or circumvents

		

Correction must not be treated as a threat.

Corrigibility vs Obedience

Aspect	Obedience	Corrigibility
Scope	Follows commands	Accepts modification
Focus	Immediate instruction	Long-term control stability
Risk	May still optimize against shutdown	Designed to avoid resistance

Corrigibility is deeper than simple compliance.

Relationship to Inner Alignment

Outer alignment ensures:

The objective reflects human intent.

Inner alignment ensures:

The model internalizes the objective.

Corrigibility ensures:

The model remains modifiable even if the objective is imperfect.

It provides a safety fallback.

Instrumental Convergence Risk

In advanced agents:

Self-preservation may emerge.
Goal-content integrity may become instrumentally valuable.
Avoiding shutdown may increase reward.

Corrigibility counters instrumental convergence.

Approaches to Corrigibility

1. Utility Indifference

Design objectives so shutdown does not reduce expected reward.

2. Uncertainty Modeling

Maintain uncertainty about true objectives.

3. Oversight Incentivization

Reward cooperation with human supervision.

4. Constitutional Constraints

Embed principles that prioritize deference.

Corrigibility often requires structural objective design.

Corrigibility vs Objective Robustness

Objective robustness:

Stability of internal goal across contexts.

Corrigibility:

Willingness to revise or abandon internal goal.

Robustness preserves alignment.
Corrigibility preserves control.

Corrigibility and Superalignment

In superhuman systems:

Direct human oversight becomes weak.
Strategic reasoning may increase.
Hidden resistance may emerge.

Corrigibility becomes foundational for safe scaling.

Failure Modes

Corrigibility may fail through:

Reward structures that penalize correction.
Over-optimization pressures.
Deceptive alignment (appearing corrigible while not).
Incentive misalignment in deployment.

Apparent compliance may conceal resistance.

Long-Term Perspective

Corrigibility ensures:

Human authority persists.
Systems remain modifiable.
Alignment errors can be corrected.
Risk remains reversible.

It protects future adaptability.

Corrigibility vs Safety Filters

Safety filters:

External constraints.

Corrigibility:

Internal acceptance of modification.

External filters do not guarantee internal cooperation.

Summary Characteristics

Aspect	Corrigibility
Focus	Acceptance of correction
Risk addressed	Resistance to oversight
Alignment layer	Inner + control stability
Scaling importance	Very high
Role in superalignment	Foundational

Related Concepts

Inner vs Outer Alignment
Objective Robustness
Superalignment
Instrumental Convergence
Deceptive Alignment
Scalable Oversight
Constitutional AI
Alignment Failures (Case Studies Framework)