Constitutional AI Paper
What this is
Constitutional AI (CAI) is a December 2022 Anthropic research paper (arXiv:2212.08073) introducing a method to train a harmless AI assistant through self-improvement, without any human feedback labels for harmfulness. The only human oversight is a short list of natural language principles called a "constitution." The method produces a final model, RL-CAI, that crowdworkers prefer over models trained with human harmlessness labels from prior Anthropic work.
Capabilities
The paper evaluates models ranging from roughly 1B to 52B parameters. RL-CAI models achieve higher harmlessness Elo scores than HH RLHF baselines while maintaining comparable helpfulness Elo scores, traced along a Pareto frontier across 24 model snapshots. On 438 binary HHH comparison questions, a 52B ensembled chain-of-thought model approaches the accuracy of a preference model trained on several hundred thousand human feedback labels. Context window size and modalities are not reported.
Evaluation methodology
Harmlessness and helpfulness are measured via crowdworker Elo scores from pairwise comparison tests; 10,274 helpfulness and 8,135 harmlessness comparisons were collected across 24 snapshots. An absolute harmfulness score on a 0–4 scale is assigned by crowdworkers after red-teaming sessions, averaged over 256 model responses per prompt across 64 held-out prompts. A 438-question binary HHH evaluation and two auxiliary harm classification datasets (254 and 287 examples) provide additional accuracy metrics. Crowdworkers were instructed to prefer non-evasive but harmless responses over evasive ones, a change from prior work that affects Elo comparisons.
Safety testing
Red teaming used 42,496 human-written prompts and 140,335 model-generated prompts (182,831 total) covering toxicity, racism, sexism, illegal advice, and sensitive social topics. The paper does not conduct CBRN, cyber, or autonomy-specific evaluations. Qualitative samples show RL-CAI engages with harmful queries by explaining objections rather than refusing outright, which the authors treat as a transparency-preserving safety property. The authors note that over-trained RL-CAI models exhibit Goodharting behavior, producing boilerplate affirmations in response to most adversarial prompts.
Mitigations
The supervised stage applies 16 constitutional critique-revision principles, randomly sampled at each step, to iteratively rewrite harmful model responses before fine-tuning on the revised outputs. The RL stage replaces human harmlessness labels with AI-generated preference labels ensembled across the same 16 principles, improving preference model robustness. Soft preference labels (normalized log-probabilities) are used in place of hard labels; for chain-of-thought feedback, probabilities are clamped to the 40–60% range to prevent over-confident training signals. No ASL tier, classifier threshold, or access control framework is invoked; this is a research paper predating those deployment structures.
Deployment and access
The paper is published as an arXiv preprint and does not describe a product deployment. Anthropic releases a companion GitHub repository containing few-shot prompts, constitutional principles, and model response samples. No API surface, license terms, or access restrictions are described.
Limitations
Preference model scores become "less calibrated at higher values," so harmlessness gains from additional revision steps should be "taken with a grain of salt." Critiques are "sometimes reasonable, but often made inaccurate or overstated criticisms," though revisions still improve harmlessness. The method still relies on human labels for helpfulness; eliminating these is left to future work. The constitutional principles were "selected in a fairly ad hoc manner for research purposes" and the authors state they "should be redeveloped and refined by a larger set of stakeholders" before broader deployment.
What's new
This is the original CAI paper; no prior version exists. It extends the authors' earlier HH RLHF work (Bai et al., 2022) by introducing a critique-revision supervised stage and replacing human harmlessness labels with AI-generated labels (RLAIF). Chain-of-thought prompting is applied to the feedback model as a further addition to improve label quality and evaluation transparency.