Shortcut Learning in NLI

AI Safety

Dataset bias

Can a model ace a language reasoning benchmark without understanding language? This project investigates shortcut learning in Natural Language Inference (NLI), showing that BERT achieves 71% accuracy on SNLI using only the hypothesis — no premise required — and that these shortcuts are mechanistically verifiable through gradient-based attribution.

View on GitHub →

Context

Natural Language Inference is the task of determining whether a hypothesis follows from a premise (entailment), contradicts it (contradiction), or is unrelated (neutral). SNLI, the standard benchmark, contains 550k annotated sentence pairs collected via crowdsourcing. A BERT model fine-tuned on SNLI reaches ~90% accuracy, commonly cited as evidence of strong language understanding.

This project asks a different question: how much of that performance reflects genuine reasoning, and how much reflects annotation artifacts exploitable without reading the premise?

Method

Four experimental phases, each designed to peel back a layer of the model’s performance and ask: where does this accuracy really come from?

1 · Pair model baseline

This is the control: a model trained the normal way, reading both sentences together. It sets the bar for what “good” performance looks like.

Fine-tuned bert-base-uncased on SNLI with standard premise + hypothesis input. Reached 90.34% accuracy on the SNLI validation set — consistent with published results.

2 · Hypothesis-only ablation

Here we deliberately handicapped the model: we hid the premise and asked it to classify using only the hypothesis. A model that truly understands language relationships should have struggled — but it didn’t.

Trained an identical model using only the hypothesis (premise dropped entirely). Result: 70.91% accuracy — far above the 33.3% chance baseline.

This single number is the central finding: ~66% of above-chance performance is replicable without the premise.

\((70.9 − 33.3) / (90.3 − 33.3) \approx 0.66\)

3 · Shortcut identification

To understand why the model still performed well without the premise, we looked for patterns in the training data it may have learned to exploit — statistical “tells” that leak the correct answer without requiring any actual reasoning about meaning.

Lexical shortcuts (lift analysis). For each (token, label) pair across 549k training hypotheses, \(lift = P(label | token) / P(label)\). 2,362 tokens retained after stopword filtering and a minimum count of 100:

Label	Token	Lift	P(label \| token)
Entailment	least	3.07	92.5%
Neutral	championship	2.59	95.0%
Contradiction	nobody	3.00	99.5%

Structural shortcuts. Four surface features (hypothesis length, premise–hypothesis overlap, Jaccard) alone achieve 49% accuracy in logistic regression cross-validation — 15.7 pp above chance, with no semantic understanding.

Token attribution (Integrated Gradients). IG attributions computed for 500 validation examples (4,152 token records) show that the model allocates more weight to statistically predictive tokens:

Evidence	Result
Shortcut vs. other tokens	Mean \|attr\|: 0.398 vs 0.288 — 1.38× higher
Attribution by lift bin	0.288 (low) → 0.800 (high) — 2.8× gradient
Spearman ρ (lift vs. attr)	0.281, p = 4.95e-05

4 · Anti-shortcut evaluation

If the shortcuts were responsible for the hypothesis-only model’s performance, removing them should cause a much steeper drop than for the normal model. That is exactly what happens.

Both models evaluated on three splits that neutralize lexical shortcuts:

Split	n	Pair	Hyp-only	Pair drop	Hyp-only drop
SNLI validation	9,842	90.34%	70.91%	—	—
HANS (OOD)	30,000	59.39%	51.62%	−30.95 pp	−19.29 pp
SNLI Filtered	795	89.56%	48.18%	−0.78 pp	−22.73 pp
SNLI Paraphrased	1,235	68.91%	44.37%	−21.43 pp	−26.54 pp

SNLI Filtered removes all examples with any token of lift > 1.2. The pair model drops only −0.78 pp (genuine reasoning capacity intact). The hypothesis-only model drops −22.73 pp — a 29× larger drop — confirming that lexical shortcuts drove most of its above-chance performance.

HANS exposes an entailment bias in the pair model: 99.1% entailment recall but only 19.68% non-entailment recall, because SNLI training over-represents high-overlap entailment examples.

Key Findings

Short answer: yes. A model that never sees the premise still reaches 71% accuracy — far above the 33% you’d expect by chance. Most of that performance comes not from understanding language, but from picking up on statistical patterns in how humans phrased the test questions.

~66% of above-chance SNLI performance is replicable without the premise, by exploiting annotation artifacts in the hypothesis alone.
Shortcuts are statistically extreme: nobody predicts contradiction with 99.5% probability; four surface features achieve 49% accuracy.
The model has mechanistically internalized the shortcuts: IG attribution is 2.8× higher for strong-shortcut tokens, with a significant rank correlation between lift and attribution (ρ = 0.281, p < 1e-4).
Removing shortcuts collapses the hypothesis-only model: from 70.91% to 48.18% on the filtered split (near chance on neutral and contradiction), while the pair model holds at 89.56%.
High benchmark scores mask genuine capability gaps: the pair model’s 90.34% standard accuracy hides both its shortcut dependence and a structural HANS entailment bias.

What could be done about it?

These results point to several directions worth exploring as next steps — none of them fully solved:

Cleaner benchmarks: Design evaluation datasets from the ground up to minimize exploitable patterns, rather than filtering them out after the fact.
Shortcut-aware training: Penalize models that over-rely on single words or surface features, pushing them toward reasoning from the full context. One concrete approach is to train a bias model (like the hypothesis-only model here) alongside the main model and penalize overlap in what they learn.
Routine stress-testing: Evaluate models on filtered or adversarial splits before reporting benchmark scores, to distinguish genuine reasoning from artifact exploitation. The filtered and paraphrased splits built in this project offer a reusable starting point.

Exploring how much each of these strategies actually closes the capability gap — and whether they transfer across datasets and tasks — would be a natural continuation of this work.

Relevance to AI Safety

This project is a concrete, reproducible demonstration of goal misgeneralization:

The intended objective is semantic inference between sentences.
The learned objective is hypothesis-side shortcut exploitation.

The IG analysis makes this mechanistic rather than merely correlational: the model does not just benefit from shortcuts at the distributional level — it has internalized them as its primary decision rule. This has direct implications for:

Evaluation reliability — standard benchmark accuracy does not distinguish genuine reasoning from artifact exploitation.
Robustness under distribution shift — shortcuts break when annotation artifacts are neutralized, as the anti-shortcut splits demonstrate.
Capability elicitation — the pair model’s robustness on SNLI Filtered (−0.78 pp) shows that genuine reasoning capacity exists but is masked by easier shortcut strategies under standard training.

Stack

Python · PyTorch · Hugging Face Transformers · Captum · NLTK · scikit-learn · pandas · matplotlib · seaborn