Summary of Shallow Review of Technical AI Safety, 2024 (LessWrong)

Feb 27, 2025

This Shallow Review provides an overview of ongoing research agendas in technical AI safety as of 2024, categorizing efforts into key domains. The review highlights the breadth of activity without implying equivalent progress across all areas. It serves as a reference for researchers, policymakers, and funders.

Key Research Areas:

1. Understanding Existing Models

Evals: Developing tools to assess AI capabilities and risks.
Red-teaming: Stress-testing models to identify vulnerabilities.
Eliciting Model Anomalies: Studying outlier behaviors and unexpected model outputs.
Interpretability: Reverse-engineering models to understand their internal mechanisms.
Mechanistic Interpretability
Sparse Autoencoders
Concept-based Interpretability

2. Controlling AI Behavior

Preventing Deception & Scheming: Detecting deceptive tendencies and steering models away from unsafe behaviors.
Surgical Model Edits: Directly modifying model activations to enforce desired behavior.
Goal Robustness: Ensuring AI objectives remain stable and aligned.

3. Safety by Design

Alternative Architectures: Exploring AI designs that avoid deep learning’s alignment risks.

4. Using AI to Solve AI Safety

Scalable Oversight: Using weaker AI systems to oversee stronger ones.
Task Decomposition: Breaking down complex tasks to improve safety.
Adversarial Approaches: Stress-testing models through competitive dynamics.

5. Theoretical Foundations

Understanding Agency: Investigating how AI systems develop goals and optimize behavior.
Corrigibility: Designing AI systems that can be reliably corrected.
Ontology Identification: Mapping how AI conceptualizes the world to ensure alignment.

6. Miscellaneous Initiatives

AI Governance & Policy: Growing institutional efforts to regulate AI safety.
Industry & Academic Efforts: Contributions from major research groups like OpenAI, DeepMind, Anthropic, and independent researchers.

Trends and Observations

Shifts Toward Control Over Alignment: The field is increasingly focusing on controlling AI systems rather than aligning them with human values.
Increasing Government & Industry Involvement: AI safety is becoming mainstream, with government agencies and tech companies investing in research.
Concerns Over Model Scaling: Despite claims that AI progress is slowing, substantial advances in multimodality, reasoning, and agency have occurred in 2024.
Emerging Risks: AI deception, steganography, and unaligned agency remain significant challenges.

This review provides a high-level map of AI safety research, emphasizing areas that require further exploration and funding.