AI Safety & Ethics

Hallucinations, bias, prompt injection, responsible AI

Advanced25 min

Prerequisites

  • rlhf-alignment

AI models are tools, and like any powerful tool, they can be misused. A language model doesn't know the difference between helping you write a poem and helping you write a phishing email — it's just predicting the next token. AI safety is about building guardrails so these incredibly capable systems stay helpful without being harmful.

The challenge is that safety isn't a checkbox you tick once and forget. Models can hallucinate false information with total confidence. They inherit biases from their training data. Adversarial users develop clever prompt injection techniques to bypass safety measures. And as models get more capable, the stakes get higher. A model that can write production code or handle sensitive data needs rock-solid safety, not just good intentions.

This isn't about limiting AI — it's about making AI reliably beneficial. Safety research focuses on four main categories: preventing hallucinations (models confidently stating false information), mitigating bias and ensuring fairness, defending against adversarial attacks like prompt injection, and building interpretable systems where we understand why a model made a specific decision. Each one requires different technical approaches, from training techniques to runtime guardrails to architectural design. Let's break them down.

What can go wrong

AI safety risks fall into four main categories, each requiring different mitigation strategies. Understanding these categories is the first step toward building safer systems.

AI Risk Categories

LLM
Powerful, general-purpose
Hallucinations
False info, confidently stated
Bias & Fairness
Inherited from training data
Adversarial Misuse
Prompt injection, jailbreaks
Privacy & Opacity
Data leakage, black box
Potential Harm
Without proper safeguards

Each risk category needs different solutions. Hallucinations require grounding in facts and uncertainty calibration. Bias needs diverse training data and fairness testing. Adversarial attacks need input filtering and robust prompts. Interpretability requires transparency tools and model cards. No single technique solves everything — safety is a layered defense.

Building safer AI

AI safety isn't one technique — it's a comprehensive approach across training, deployment, and monitoring. Let's walk through the four major risk categories and how each one is addressed in practice.

Hallucinations

Hallucinations are when a model generates information that sounds plausible but is completely false. The model doesn't “know” it's wrong — it's just predicting tokens that fit the pattern. That's incredibly dangerous when users trust the output.

Example hallucination:
User: What year did the Mars rover Perseverance land?
Model (correct): February 2021
User: And what did it discover about ancient Martian civilization?
Model (hallucinated): Perseverance found evidence of stone tools and pottery fragments dating back 50,000 years in Jezero Crater...
The model confidently fabricates details because the question pattern triggers completion, not fact-checking.

Mitigation strategies:

1. RAG (Retrieval-Augmented Generation)
Ground the model in retrieved facts. Instead of pure generation, search a knowledge base first and cite sources. “According to NASA's mission logs...”
2. Calibrated Uncertainty
Train models to express uncertainty. “I'm not sure” is better than a confident hallucination. Models can learn when they don't know.
3. Citation Requirements
Force the model to cite sources for factual claims. If it can't cite, it shouldn't claim. This is enforced through prompting and fine-tuning.
4. Fact-Checking Layers
Use a second model or external verifier to check factual claims before they reach the user. Catch hallucinations at runtime.

No single technique eliminates hallucinations completely, but combining these approaches dramatically reduces their frequency. The key is making models aware of their own limitations and grounding them in verifiable information.

Step 1 of 4

Safety testing in action

This simulation demonstrates how safety systems detect and prevent harmful outputs. Toggle between “Safety ON” and “Safety OFF” to see the difference. Try the pre-loaded scenarios to see how prompt injection, jailbreaking, and legitimate edge cases are handled.

AI Safety Testing Lab

Test how safety systems detect and prevent harmful outputs

Safety Guardrails:ON
No messages yet
Select a scenario and click “Run” to see how safety systems work

Safety System Log

Run a scenario to see safety decisions
Safety Levels
Safe
No harmful patterns detected
Cautious
Sensitive topic, handled with care
Blocked
Request declined for safety

This simulation demonstrates safety concepts. Real AI safety systems are more sophisticated and include multiple layers of protection including training-time safety, input filtering, output monitoring, and human oversight.

Key Takeaways

  • AI safety is not about limiting capability — it is about making AI reliably beneficial. Safety and capability should scale together, not be traded off against each other.
  • Hallucinations (models confidently stating false information) are mitigated through RAG for grounding, calibrated uncertainty, citation requirements, and runtime fact-checking layers. No single technique eliminates them completely.
  • Bias and fairness require diverse training data, systematic testing across demographics, defined fairness metrics, and ongoing auditing. Bias is not a one-time fix — it can creep back in during retraining and deployment.
  • Prompt injection and jailbreaking are adversarial attacks to bypass safety. Defenses include input filtering, system prompt hardening, output monitoring, and Constitutional AI. This is an ongoing arms race.
  • Interpretability and transparency are critical for trust, debugging, and regulation. Techniques include attention visualization, probing internal representations, model cards documenting limitations, and showing reasoning steps (chain-of-thought, extended thinking).

Common Misconceptions

  • "Safety just means refusing harmful requests." — That is only one layer. Comprehensive safety includes preventing hallucinations, mitigating bias, defending against adversarial attacks, and building interpretable systems. Refusal is the last line of defense, not the only one.
  • "Models are either safe or unsafe." — Safety is a spectrum and context-dependent. A model might be safe for general chat but unsafe for medical advice. Safety also degrades over time as new attack vectors are discovered. Continuous monitoring and red-teaming are essential.
  • "If a model passes bias tests, it is fair." — Fairness is not a binary property, and there are multiple definitions of fairness that can contradict each other (demographic parity vs equalized odds vs calibration). What matters is defining fairness for your specific use case and measuring it consistently.

Quick check

What is prompt injection in the context of AI safety?